Project in Computational Game Theory: Communities in Social Networks

Similar documents
Algorithms Reading Group Notes: Provable Bounds for Learning Deep Representations

CS 781 Lecture 9 March 10, 2011 Topics: Local Search and Optimization Metropolis Algorithm Greedy Optimization Hopfield Networks Max Cut Problem Nash

12. LOCAL SEARCH. gradient descent Metropolis algorithm Hopfield neural networks maximum cut Nash equilibria

ACO Comprehensive Exam October 14 and 15, 2013

Umans Complexity Theory Lectures

A Linear Round Lower Bound for Lovasz-Schrijver SDP Relaxations of Vertex Cover

Efficient Approximation for Restricted Biclique Cover Problems

Balanced Allocation Through Random Walk

Integer Linear Programs

Notes on MapReduce Algorithms

Label Cover Algorithms via the Log-Density Threshold

ACO Comprehensive Exam March 17 and 18, Computability, Complexity and Algorithms

NP Completeness and Approximation Algorithms

Partitioning Metric Spaces

CS6999 Probabilistic Methods in Integer Programming Randomized Rounding Andrew D. Smith April 2003

Convex and Semidefinite Programming for Approximation

11.1 Set Cover ILP formulation of set cover Deterministic rounding

Overlapping Communities

An Improved Approximation Algorithm for Requirement Cut

Integer Programming ISE 418. Lecture 12. Dr. Ted Ralphs

Communities Via Laplacian Matrices. Degree, Adjacency, and Laplacian Matrices Eigenvectors of Laplacian Matrices

Lecture 59 : Instance Compression and Succinct PCP s for NP

Aditya Bhaskara CS 5968/6968, Lecture 1: Introduction and Review 12 January 2016

Solving the MWT. Recall the ILP for the MWT. We can obtain a solution to the MWT problem by solving the following ILP:

Protein Complex Identification by Supervised Graph Clustering

Lecture 5: January 30

1 T 1 = where 1 is the all-ones vector. For the upper bound, let v 1 be the eigenvector corresponding. u:(u,v) E v 1(u)

Approximation & Complexity

Cographs; chordal graphs and tree decompositions

On the Impossibility of Black-Box Truthfulness Without Priors

CMPUT 675: Approximation Algorithms Fall 2014

12. LOCAL SEARCH. gradient descent Metropolis algorithm Hopfield neural networks maximum cut Nash equilibria

Semidefinite and Second Order Cone Programming Seminar Fall 2001 Lecture 5

PCP Course; Lectures 1, 2

Generating p-extremal graphs

Lecture 13: 04/23/2014

Price of Stability in Survivable Network Design

Lecture December 2009 Fall 2009 Scribe: R. Ring In this lecture we will talk about

Lecture 5: The Principle of Deferred Decisions. Chernoff Bounds

Lecture 17 November 8, 2012

1 Primals and Duals: Zero Sum Games

CS Communication Complexity: Applications and New Directions

The Probabilistic Method

Notes for Lecture 2. Statement of the PCP Theorem and Constraint Satisfaction

Lecture 20: LP Relaxation and Approximation Algorithms. 1 Introduction. 2 Vertex Cover problem. CSCI-B609: A Theorist s Toolkit, Fall 2016 Nov 8

Lecture 8: February 8

Exact Algorithms for Dominating Induced Matching Based on Graph Partition

Topic: Balanced Cut, Sparsest Cut, and Metric Embeddings Date: 3/21/2007

Overlapping Community Detection at Scale: A Nonnegative Matrix Factorization Approach

Robust Principal Component Analysis

Lecture 11 October 7, 2013

Dynamic Matching under Preferences

CS5314 Randomized Algorithms. Lecture 18: Probabilistic Method (De-randomization, Sample-and-Modify)

Sampling. Everything Data CompSci Spring 2014

K-center Hardness and Max-Coverage (Greedy)

Lecture 2: January 18

Hardness of Approximation

Decomposition of random graphs into complete bipartite graphs

CLEP Precalculus - Problem Drill 15: Systems of Equations and Inequalities

Dynamic Programming on Trees. Example: Independent Set on T = (V, E) rooted at r V.

Polyhedral Approaches to Online Bipartite Matching

Notations. For any hypergraph H 2 V we have. H + H = and H + H = 2 V.

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Chromatic number, clique subdivisions, and the conjectures of Hajós and Erdős-Fajtlowicz

Approximating matching size from random streams

1.3 Vertex Degrees. Vertex Degree for Undirected Graphs: Let G be an undirected. Vertex Degree for Digraphs: Let D be a digraph and y V (D).

6.207/14.15: Networks Lecture 7: Search on Networks: Navigation and Web Search

Packing and decomposition of graphs with trees

Distributed Distance-Bounded Network Design Through Distributed Convex Programming

Modelling self-organizing networks

Notes 6 : First and second moment methods

Statistical and Computational Phase Transitions in Planted Models

Algebraic Methods in Combinatorics

Finding Large Induced Subgraphs

NP-Completeness. Andreas Klappenecker. [based on slides by Prof. Welch]

NP-Complete Problems and Approximation Algorithms

25 Minimum bandwidth: Approximation via volume respecting embeddings

1 Reductions and Expressiveness

Subsampling Semidefinite Programs and Max-Cut on the Sphere

3.8 Strong valid inequalities

NP-Complete Problems. More reductions

A better approximation ratio for the Vertex Cover problem

The Lefthanded Local Lemma characterizes chordal dependency graphs

COMPSCI 611 Advanced Algorithms Second Midterm Exam Fall 2017

Extremal Graphs Having No Stable Cutsets

Exact and Approximate Equilibria for Optimal Group Network Formation

Solving Random Satisfiable 3CNF Formulas in Expected Polynomial Time

1 Matrix notation and preliminaries from spectral graph theory

Local MAX-CUT in Smoothed Polynomial Time

Chapter 5-2: Clustering

Lecture 14: SVD, Power method, and Planted Graph problems (+ eigenvalues of random matrices) Lecturer: Sanjeev Arora

U.C. Berkeley CS278: Computational Complexity Professor Luca Trevisan 9/6/2004. Notes for Lecture 3

Lecture 4: Graph Limits and Graphons

Undirected Graphical Models

Oblivious and Adaptive Strategies for the Majority and Plurality Problems

CSI 445/660 Part 6 (Centrality Measures for Networks) 6 1 / 68

Independence numbers of locally sparse graphs and a Ramsey type problem

The Lovász Local Lemma : A constructive proof

Approximate Gaussian Elimination for Laplacians

A necessary and sufficient condition for the existence of a spanning tree with specified vertices having large degrees

Transcription:

Project in Computational Game Theory: Communities in Social Networks Eldad Rubinstein November 11, 2012 1 Presentation of the Original Paper 1.1 Introduction In this section I present the article [1]. The article starts by presenting the term community in the context of sociology and graph theory. A community is described as a group of nodes more densely connected with each other than with the rest of the network. When communities are not allowed to overlap, there are many successful approaches to recover the community structure of a network, such as clustering. However, in real life social networks, communities usually overlap each other, so other directions are necessary. Solving this problem directly is not possible, because it would lead to NP-hard problems. Another possible method is assuming a generative model that explains how communities are created. However, developing such models requires reliable data about real world community structure, which is the problem we try to solve in the first place. The presented paper tries to solve this problem by stating assumptions about the network (as opposed to complete models), and suggesting algorithms that try to recover the communities from the network under these assumptions. The assumptions are based on ego-centric analysis, which is a sociological approach that uses information about ties of individuals to gain insight about the entire network. 1

1.2 Assumptions The basic setup is as follows: The network is a graph of size n. Each community C is an arbitrary set of nodes, unknown to the algorithm. The following assumptions are made: 0. Each person participates in up to d communities, where d is constant or small. 1. Expected Degree Model Each node u C has an affinity p u [0, 1]. For u, v C, The edge (u, v) exists with probability e u,v = p u p 1 v. 2. Maximality with gap ε Nodes outside the community C are less strongly connected to it than community nodes are. For example, if p u = α for u C, then w / C has edges to less then α ε fraction of nodes in C. 3. Communities explain γ fraction of each person ties. 1.3 First Step: Communities are Cliques In the following sections, another assumption is made: For each community C, δk C k where δ > 0 is some constant and k is arbitrary but known to the algorithm. First, we assume that all communities are cliques, that is each node u has an affinity p u = 1. Under these assumptions the following theorem is proven: Theorem 1 The Clique-Community-Find algorithm outputs each community with probability 2/3 in time O (nk/δγ) 2Õ(log2 d). Algorithm 1 Clique-Community-Find (informal overview) 1: Pick starting nodes uniformly at random 2: For each starting node v, randomly sample S Γ (v) by including each node with probability p. Proceed only if the sample is large enough. 3: for all cliques U of size at most 2pk in the induced graph G(S) on S do 4: V {nodes in Γ(v) which are connected to all nodes in U} 5: Return high degree vertices from G(V ) 6: end for 1 Each node might belong to more than one community. In this case each node has a different affinity for each community it belongs to, and the edge (u, v) exists with probability which is at least the maximum of p u p v for all the communities u, v are in. 2

1.4 Communities are Dense Subgraphs In reality, communities are usually not cliques, so the algorithm presented in the previous section will not necessarily work. Instead, we assume that all affinity values are p u = α (even if two nodes u, v belong to more than one community, e u,v = α). Using a similar algorithm, it can be proven that: Theorem 2 Under the described assumptions (p u = α), each community can be found with high probability over the graph ( randomness and with probability 2/3 over the algorithm randomness in time O n (k/γαδ) d) ) 2Õ(log2. The assumption regarding α can be relaxed even further in order to fit real-life communities, so now we assume p u α (for a fixed α). Theorem 3 Under the described assumptions (p u α), each community can be found with high probability over the graph ( randomness and with probability 2/3 over the algorithm randomness in time O n (k/γαδ) T d) ) 2Õ(log2. The reason for the worse running time is the different algorithm used in this case. Previous algorithms might not work here, because Γ(v) C is no longer a uniform subset of C. Therefore, in order to prove Theorem 3, the used algorithm has to loop over all sets of nodes S Γ(v) of certain size T for each sampled node v. 1.5 Communities with Very Different Sizes We would like to release the assumption about community sizes that was made in Section 1.3 and allow communities of any size. However, sampling might miss some of the small communities. The previous algorithms will still work if they take enough samples, but then they will not be efficient (1/δ will be too large). Instead, a different algorithm is presented. This algorithm uses the following definition: Definition For α, ε > 0 an (α, α ε)-set is a subset of nodes such that 1) every node in the set has edges to at least an α fraction of nodes in the set; and 2) every outside node has edges to at most an α ε fraction of nodes in the set. Theorem 4 Under the base assumptions (0-3), and if for all communities, p u αmin and all communities are (α C ε/8, α C 7ε/8) sets, then Any-Size-Dense- Community-Find algorithm will output all communities in time 2 O (n log(kd/γ) α min ε ). 2 2 k is still used to denote the size of the largest community. 3

Algorithm 2 Any-Size-Dense-Community-Find 100 log(kd/γ) α min ε 2 1: T = 2: for α = 1 downto α min step ε/4 do 3: for all sets of nodes S of size T do 4: U {v : α ε/4 fraction of its edges are to S} 5: Return U if it is a (α, α ε/2) set 6: end for 7: end for Any-Size-Dense-Community-Find algorithm does solve the problem of communities with very different sizes, however iterates over many sets, and therefore its running time is not polynomial. The paper presents another algorithm, that finds cliques of very different sizes. It needs the following additional assumptions: 3. This is a stronger variation of the original Assumption 3. It states that every community a node v belongs to has size at least γ/d times the number of ambient edges incident on the node v. That is, small communities are distinguishable from noise edges. 4. Distinctness For u C, at least a constant factor of the community C does not lie in any other community containing u. 5. Completeness (Duck Assumption) Any set that satisfies all the assumptions of a community in the model is a community. This ensures that Assumption 4 cannot be satisfied by pretending that a certain set in not a community. Theorem 5 Given a graph G that satisfies all presented assumptions, cliques of different sizes can be found in it with high probability in polynomial time. The algorithm required to prove this theorem works in the following way: it finds large cliques first using sampling. Then it ignores their edges and continues to smaller cliques. The extra assumptions ensure these small communities can be found. 1.6 Relaxing the Assumptions The presented paper states that Theorems 3 and 4 still hold even if Assumption 1 (expected degree model assumption) is released, but instead we have to require that the following are concentrated around their expectation: The number of edges from any node u to any community C (that is, Γ(u) C ) 4

The degree of each node are concentrated around their expectation. The number of nodes that are neighbors of two nodes u, v inside a community C (that is, Γ(u) Γ(v) C ) Another assumption that can be (independently) relaxed is Assumption 2, known as the Gap Assumption. The paper shows that even if the assumption does not hold, as it might be the case with real-life graphs, Clique-Community-Find Algorithm (see Section 1.3) will still return sensible answers: Theorem 6 Let G be a graph that satisfies all assumptions of Theorem 1 other than the Gap Assumption. An adopted version of Clique-Community-Find Algorithm will be able to find the communities in G, but for every community C the algorithm will return a similar set C (and not C itself). The paper states that Gap Assumption can be also relaxed in a similar way in the setting of Theorem 2 (where p u = α). 1.7 Sparser Communities Ideas that are related to those shown in the previous sections can be used to identify sparse communities, in which the probability that two community nodes will be connected is smaller. A different set of assumptions is used here: 0. Assumption 1 (expected degree model). 1. (u, v) exists with probability B/ k, where C = k for all communities and B > 10 is a constant. 2. All edges belong to some community. 3. Communities intersection size is limited C C k/20d 2. 3 In order to find communities under these assumptions, we transform our sparse G into a dense graph G that fulfills the assumptions of Theorem 3, that deals with dense and similar size communities. Theorem 7 Let a graph G and a set of communities C be consistent with the Sparse Model. Construct a graph G on the same set of nodes, where u, v have an edge in G if and only if they have at least B 2 /2 length-2 paths in G. Then the pair (G, C) are consistent with the assumptions of Theorem 3 with parameters (n, k, d, α, δ, ε, γ) = (n, k, d, 0.9, 1, 0.6, 1/3d). 3 Recall that d is the maximum number of communities in which each person can participate. 5

Table 1: Summary of the Community Finding Algorithms case no. extra / different assumptions? probability of edges in communities communities sizes must be similar? running time 1. No Cliques Yes Polynomial 2. No p u = α Yes Polynomial 3. No p u α Yes Polynomial 4. No p u α No Quasi-Poly 5. Extra Cliques No Polynomial 6. Different Sparse Yes Polynomial 2 Open Questions One can think of the several areas of possible further research following the article: Releasing the base assumptions in more cases, especially the Expected Degree Model Assumptions and the Gap Assumption. Polynomial algorithm for dense communities with different sizes. The paper presents two algorithms for communities with very different sizes. The first is not polynomial, and the second adds more restrictions. The question is whether some of these restrictions can be relaxed. Implementation of the suggested algorithms. This might involve using suitable heuristics in order to find cliques or dense subgraphs. Testing on real-world data. The questions are will this data satisfy the assumptions, and what will be the running time (possibly using heuristics). Adapting the algorithms to a dynamic setting. For instance, a model in which there is an initial graph that is built according to the assumptions, and then each node connects to each neighbor of a neighbor with some probability. 6

3 Understanding the Proof of Theorem 1 In this section we will try to fill some of the missing details in the proof of Theorem 1 4. The proof of this theorem states that with high probability, the subsampling gives a sample S of size at most thrice the expectation. We will try to bound that probability. The paper hints that Chernoff Bound should be used. Chernoff Bound is a known probabilistic inequality that can be stated as follows: Let X 1,..., X n be independent random variables. They need not have the same distribution. Assume that 0 X i 1 always, for each i. Let X = X 1 + + X n and µ = E[X] = E[X 1 ] + + E[X n ]. Then for any ε 0, Pr [X (1 + ε )µ] exp ( ) ε 2 2 + ε µ and Pr [X (1 ε )µ] exp ( ) ε 2 2 µ 1 v i S In our case, we can define X i for each v i Γ(v) such that X i =. 0 v i / S Therefore, E[ S ] = E[X 1 + X n ] = E[X] = E[X 1 ] + + E[X n ] = µ Using Chernoff Bound (we choose ε = 2), Pr [ S 3E[ S ]] exp ( E[ S ]) = exp It is known that: ( deg(v) δk log(12d/εδγ) ) ε d 1, because d denotes the maximum number of communities in which each person can participate. 0 < ε, δ, γ < 1, because ε, δ, γ all represent fractions. deg(v) δk, because δk is the minimum community size, v is a member of at least one community and all communities are cliques in this setting. Therefore, assuming log is the natural logarithm, Pr [ S 3E[ S ]] exp ( log(12)) = 1 12 4 It is numbered Theorem 1 in the original paper as well. 7

4 Implementing Clique-Community-Find We would like to make the first step towards testing the suggested algorithms, which is to implement Clique-Community-Find Algorithm (see Section 1.3) and test it on toy data. The source code (in Python) is attached to the project. Two toy data graphs are used (see Figures 1 and 2): 1. 100 nodes, 3 non-overlapping cliques. In addition, random (noise) edges are added with probability 0.01 per edge. 2. Similar to the previous graph, but 10 nodes in each clique overlap. The implemented algorithm performs well on both data sets and consistently recovers the exact cliques using the following parameters: k = 100, δ = 0.3, ε = 0.3, γ = 0.8, d = 4 References [1] Sanjeev Arora, Rong Ge, Sushant Sachdeva, and Grant Schoenebeck. Finding overlapping communities in social networks: toward a rigorous approach. In Proceedings of the 13th ACM Conference on Electronic Commerce, EC 12, pages 37 54, New York, NY, USA, 2012. ACM. 8

21 32 12 23 25 22 19 20 28 31 13 15 4 2 17 18 0 9 7 10 1 14 11 27 26 5 6 8 3 24 16 30 29 88 66 77 90 93 95 98 82 81 97 84 94 83 70 72 73 99 80 86 79 92 71 74 89 68 67 96 69 75 91 87 85 78 76 55 63 60 52 59 44 58 50 54 56 51 41 36 40 57 64 39 35 62 37 3848 46 53 47 49 45 42 33 34 43 65 61 Figure 1: Toy Data without Overlapping Cliques 49 56 48 54 57 43 44 53 42 40 59 52 45 50 55 51 41 47 58 46 64 60 66 6165 63 62 68 69 67 99 80 78 93 71 81 76 88 75 91 89 96 72 77 83 73 98 87 90 79 85 70 94 95 92 86 84 82 97 74 37 33 35 34 38 36 39 32 31 30 20 1 29 5 11 21 17 27 2 23 4 0 25 14 13 3 8 9 22 28 16 15 26 7 24 10 18 6 19 12 Figure 2: Toy Data with Overlapping Cliques 9