Cohesive subgroups. MGT 780 Spring 2014 Steve Borgatti

Size: px

Start display at page:

Download "Cohesive subgroups. MGT 780 Spring 2014 Steve Borgatti"

George Montgomery
6 years ago
Views:

1 Cohesive subgroups MGT 780 Spring 2014 Steve Borgatti

Initially conceived of as formalizations of fundamental sociological

2 What are cohesive subgroups? Loosely speaking, network clumps Regions of high density and low distance Initially conceived of as formalizations of fundamental sociological concepts Primary groups Emergent groups Now typically thought of in terms of a technique for identifying groups within networks

3 Terminology Groups are sociological constructs Typically assumed to be recognized by their members A sense of legitimate membership - boundaries Some sense of voluntarity Internal cohesion Classes are categories of actors, such as socio-economic classes or genders Defined by similarity rather than cohesion Clusters are mathematical objects Sets of similar books, being inanimate objects, form clusters rather than groups Communities are what physicists call clusters that are intended to correspond to groups None of these distinctions is hard and fast 8 March 2013 Copyright (c) 2013 by Steve Borgatti. All rights reserved. 3

5 Partitions and classes Partition is an exhaustive division of items into sets of mutually exclusive classes Not all subgroup methods yield partitions Partitions are analytically convenient, but real groups c an be overlapping This partition has three classes, identified by color

6 Key output of subgroup analysis Group membership a nodelevel variable that indicates, for each node in the network, which group they belong to (if any) Used as independent or dependent var in subsequent analysis Node Gender Role Group HOLLY BRAZEY CAROL PAM MICHAEL BILL HARRY GERY STEVE BERT RUSS 2 2 3

7 Why we care They exist! Real networks often clumpy. Clumps are subgroups Number of groups can predict ability of network to accomplish something as a whole Need for data reduction Can analyze each group separately Can aggregate subgroups into supernodes Build simplified model of complex network (blockmodeling) They affect social processes we care about Groups are powerful influencers/constrainers of members Create homogeneity with respect to attitudes, behaviors. i.e., contagion Identify roles such as inner group member or boundary spanner They are the outcomes of social processes we care about Emergent patterns from local rules (e.g., balance theory)

8 Canonical Hypothesis Members of group will have similar outcomes (ideas, attitudes, illnesses, behaviors) Due to interpersonal transmission transference Influence / persuasion Co-construction of beliefs & practices As in communities of practice

9 Network Regions Weakly cohesive sections of a network that may contain subgroups Subgroups don t span different regions, they are wholly within regions Components Maximal sets of nodes that are mutually reachable weak and strong K-cores Maximal sets of nodes such that each member is connected to at least k other members of the k-core

10 11 K-cores Areas in which each actor is connected to k other actors all of whom have degree at least k Not cohesive but cohesive groups are contained within k-cores Finds large regions within which cohesive subgroups may be found Identifies fault lines across which cohesive subgroups do not span Useful for large datasets 2-core 7 3-core

11 Black are in 1-core only, red are 1&2-core only, blue are 1,2&3-core, grey are 1,2,3,4-core

12 K-Core more technical def A maximal set of nodes S such that for all u in S, α(u,s) >= k α(u,s) is number of ties node u has with members of set S G is 1-core & 2-core; {1..8} is 3-core There is no 4-core or higher 2-core 7 3-core

13 Components A subgraph S of a graph G is a component if S is maximal and connected If G is a digraph, then S is a weak component if it is a component of the underlying (undirected) graph S is a strong component if for all u,v in S, there is a path from u to v

14 Notes on Components Finding components is often first step in analysis of large graphs Often analyze each component separately, or discard very small components

15 Two approaches to subgroups Means Algorithmic / process / evolutionary Define a process/recipe/algorithm for extracting groups Most methods drawn from multivariate stats are of this type Ends Formal / mathematical / ideal type Define what you mean be a group, then leave it up to the researcher to construct an algorithm to find them Hybrids Have a formal definition of an ideal clustering Are defined by the (heuristic) algorithm used to get there

16 Or, four approaches to subgroups Subgroups Network analysis Multivariate stats Outcome -- clique Process -- Newman- Girvan -- Negopy Process -Hiclus -Kmeans -Newman Outcome -- factions -- comb opt Distance - n-clique - n-clan - n-club Density - k-core - k-plex - ls-set - lambda set - component

17 Typology of Subgroups Network / Graph theory Multivariate / Clustering Process Newman-Girvan Johnson s Hierarchical clustering; k- means; scaling Outcome Clique, n-clique, n-clan, n-club, k-plex, ls-set, lambda-set, k- core combinatorial optimization

18 From multivariate statistics Johnson s Hierarchical Clustering Output is a set of nested partitions, starting with identity partition and ending with the complete partition Partition represented as vector that associates each node with one and only one group (mutually exclusive) Different flavors based on how distance from a cluster to outside point/node is defined Single linkage; connectedness; minimum Complete linkage; diameter; maximum Average, median, etc.

19 MIAMI 0.20 LA SF DENVER CHICAGO DC NY BOSTON SEATTLE BOS NY DC MIA CHI SEA SF LA DEN BOS NY DC MIA CHI SEA SF LA DEN Closest distance is NY-BOS = 206, so merge these.

20 MIAMI 0.20 LA SF DENVER CHICAGO DC NY BOSTON SEATTLE BOS NY DC MIA CHI SEA SF LA DEN BOS/ NY DC MIA CHI SEA SF LA DEN Closest pair is DC to BOSNY 233. So merge these.

21 MIAMI 0.20 LA SF DENVER CHICAGO DC NY BOSTON SEATTLE BOS/NY DC BOS/ NY/ DC MIA CHI SEA SF LA DEN MIA CHI SEA SF LA DEN

22 MIAMI 0.20 LA SF DENVER CHICAGO DC NY BOSTON SEATTLE BOS/ NY/DC MIA CHI SEA SF/LA DEN BOS/NY/DC MIA CHI SEA SF/LA DEN

23 MIAMI 0.20 LA SF DENVER CHICAGO DC NY BOSTON SEATTLE BOS/ NY/D C/ CHI MIA SEA SF/L A DEN BOS/NY/DC/C HI MIA SEA SF/LA DEN

24 MIAMI 0.20 LA SF DENVER CHICAGO DC NY BOSTON SEATTLE BOS/ NY/D C/C HI MIA SF/L A/SE A DEN BOS/NY/DC/ CHI MIA SF/LA/SEA DEN

25 MIAMI 0.20 LA SF DENVER CHICAGO DC NY BOSTON SEATTLE BOS/ NY/D C/CHI /DEN MIA SF/LA /SEA BOS/NY/DC/ CHI/DEN MIA SF/LA/SEA

0.40 0.30 MIAMI 0.20 LA 0.10 0.00 SF DENVER CHICAGO DC -0.10 NY BOSTON SEATTLE -0.20-0.61-0.39-0.17 0.

26 MIAMI 0.20 LA SF DENVER CHICAGO DC NY BOSTON SEATTLE BOS/ NY/D C/CH I/DE N/SF/ LA/S EA MIA BOS/NY/DC/CHI/DEN/SF/L A/SEA MIA Done!

27 Applying HiClus to Network Data Wants symmetric data Doesn t work well with matrix of 0s and 1s not enough variation to play with Compute geodesic distances first, then cluster the distance matrix Or cluster some other valued matrix derived from network Friends in common Geodesic Distances H B C P P J P A M B L D J H G S B R HOLLY BRAZEY CAROL PAM PAT JENNIE PAULINE ANN MICHAEL BILL LEE DON JOHN HARRY GERY STEVE BERT RUSS

Hiclus of geo distances P M A J I B C U H E C H R S A L O N B H A B A T G J R P R I P L N A I A R D L E Z E E O U A O N A L I N L E R O E R E V R H S T L E M Y E N L L Y N E T Y E Y N S 1 1 1 1 1 1 1

28 Hiclus of geo distances P M A J I B C U H E C H R S A L O N B H A B A T G J R P R I P L N A I A R D L E Z E E O U A O N A L I N L E R O E R E V R H S T L E M Y E N L L Y N E T Y E Y N S Level XXXXX XXX XXX XXXXXXX XXXXXXX XXXXX XXXXX XXXXXXX XXXXXXX XXXXXXX XXXXX XXXXX XXXXXXX XXXXXXX XXXXXXXXXXXXX XXXXXXXXXXXXX XXXXXXX XXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

29 Measuring cluster adequacy Several off-the-shelf measures available All implicitly assume a concept of what subgroups are or what an optimal partition is E.g., more ties within group than between groups Within this criterion, multiple measures Newman Q modularity Eta coefficient (analysis of variance) Krackhardt & Stern EI index 12,000 others

30 Variations on hierarchical clustering Different flavors based on how distance from a cluster to outside point/node is defined Single linkage; connectedness; minimum Complete linkage; diameter; maximum Average, median, etc. (also, newman community detection, ncd)

31 Newman community detection (ncd) Built for 1/0 data specifically, networks A variation on hierarchical clustering At each point, join the pair of nodes that would result in the biggest Q modularity score Q measures the proportion of ties that are within group, minus the proportion expected by chance given group sizes

32 ncd Nod e ID Level XXX XXXXX XXX XXXXX XXXXX XXXXX XXX XXXXX XXXXX XXXXX XXXXX XXXXX XXXXX XXXXX XXXXXXX XXXXX XXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXX Q =

33 Factions Partition the data into a pre determined number of groups Has a fit function which measures how well this is done and applies combinatorial optimization to improve the fit Computationally difficult User has to define the number of groups No overlap

34 Factions Combinatorial optimization (tabu search) to find partition that minimizes Hamming distance between observed adjacency matrix and ideal matrix i.e., counts errors error Ideal errors Best fitting

35 Factions, 3 & 4 cluster solutions 3 cluster: 8 errors, Q = cluster: 9 errors, Q = In this case, Q seems less intuitive

36 Campnet Example Group Assignments: 1: HOLLY MICHAEL BILL DON HARRY 2: CAROL PAM PAT JENNIE PAULINE ANN 3: BRAZEY LEE JOHN GERY STEVE BERT RUSS BILL H B D H M P J A P P C L J B G S B R HOLLY BILL DON HARRY MICHAEL PAM JENNIE ANN PAULINE PAT CAROL LEE JOHN BRAZEY GERY STEVE BERT RUSS LEE BRAZEY STEVE BERT RUSS GERY DON HARRY MICHAEL HOLLY PAT PAM JENNI JOHN PAULINE CAROL ANN

37 Subgroups networks Multivariate stats Outcome -- clique Process -- Newman- Girvan -- Negopy Process -Hiclus -Kmeans -Newman Outcome -- factions -- comb opt Distance - n-clique - n-clan - n-club Density - k-core - k-plex - ls-set - lambda set - component

38 Newman-Girvan Calculate edge-betweenness for graph Remove edge with highest edge betweenness If number of components increases, record partition Recalculate edge betweenness & repeat until all nodes are isolates or maximum number of clusters reached/exceeded

39 Subgroups graph clustering Outcome -- clique Process -- Newman- Girvan -- Negopy Process -Hiclus -Kmeans -Newman Outcome -- factions -- comb opt Distance - n-clique - n-clan - n-club Density - k-core - k-plex - ls-set - lambda set - component

40 10 cliques found. Clique A maximal complete subgraph Everyone is adjacent to everyone else Avg distance & Diameter is 1 Density is 1 Limitations Undirected Issues Multiple, overlapping Too strict LEE STEVE 1: HOLLY MICHAEL DON HARRY 2: BRAZEY LEE STEVE BERT 3: CAROL PAT PAULINE 4: CAROL PAM PAULINE 5: PAM JENNIE ANN 6: PAM PAULINE ANN 7: MICHAEL BILL DON HARRY 8: JOHN GERY RUSS 9: GERY STEVE RUSS 10: STEVE BERT RUSS GERY BILL MICHAEL DON HARRY HOLLY PAM PAT JENNI BRAZEY RUSS JOHN PAULINE CAROL BERT ANN

41 Overlap A graph with 21 vertices can have as many as 2187 cliques Secondary analysis looks at the overlap matrices, e.g., form a node-by-node matrix O such that o ij gives number of cliques i and j are both in co-membership matrix Run hierarchical clustering on O

42 Types of Relaxations Distance (length of paths) N-clique, n-clan, n-club Density (number of ties) K-plex, ls-set, lambda set, k-core, component

43 N-cliques Definition Maximal subset s.t. for all u,v in S, d(u,v) <= n Distance among members less than specified maximum When n = 1, we have a clique Properties Relaxes notion of clique Avg distance can be greater than 1 Is {a,b,c,f,e} a 2-clique?

44 Issues with N-Cliques Overlapping {a,b,c,f,e} and {b,c,d,f,e} are both 2-cliques a b c d Membership criterion satisfiable through non-members f e Even 2-cliques can be fairly non-cohesive Red nodes belong to same 2- clique but none are adjacent

45 Subgraphs Set of nodes Is just a set of nodes b c A subgraph Is set of nodes together with ties among them An induced subgraph Subgraph defined by a set of nodes Like pulling the nodes and ties out of the original graph a a f b e c d d f e Subgraph induced by {a,b,c,f,e}

46 N-Clan Definition An n-clique S whose diameter in the subgraph induced by S is <= n Members of set within n links of each other without using outsiders Properties More cohesive than n-cliques a b c d f e Is {a,b,c,f,e} a 2-clan?

47 N-Clan Issues n-clique membership a bother Is {a,b,c,f} a 2-clan? List all 2-clans few found in data overlapping b c a d f e

48 N-Club Definition A maximal subset S whose diameter in the subgraph induced by S is <= n No n-clique requirement Properties Painful to compute More plentiful than n-clans overlapping a b c d Is {a,b,c,f} a 2-club? f e

49 Subgroups graph clustering Outcome -- clique Process -- Newman- Girvan -- Negopy Process -Hiclus -Kmeans Outcome -- factions -- comb opt Distance - n-clique - n-clan - n-club Density - k-core - k-plex - ls-set - lambda set - component

K-plex Maximal groups such that each member is connected to all but k others in the group 7,8 9 10 is a

50 K-plex Maximal groups such that each member is connected to all but k others in the group 7, is a 2-plex each member connected to 4 minus 2 other members 1,2,3,4,5 is a 3-plex each member connected 5 minus 3 others

51 K-Plex a b c e d Is {a,b,d,e} a 2-plex? Is {a,b,c,d,e} a 2-plex? A 3-plex

52 K-Plexes technical def. Definition A k-plex is a [maximal] subset S s.t. for all u in S, α(u,s) >= S -k, where S is size of set S Properties Subsets of k-plexes are k-plexes Limited diameter (i.e., get distance as freebie) If k < ( S +2)/2 then diameter <= 2 Very numerous & overlapping Better match to intuition

53 LS-Sets Definition Let H be a subset of V, K be a proper subset of H H is LS if α(k,h-k) > α(k,v-h) Subsets more connected to other LS members than outsiders of LS set or H is LS if α(k) > α(h) Subsets better off joining LS set This one s usually easier to compute

54 LS-Sets H is LS if α(k,h-k) > α(k,v-h) Use when K is large or H is LS if α(k) > α(h) Use when K is small 4 3

55 LS-Sets Properties very cohesive Wholly nested or disjoint: no partial overlaps More ties within than between (doesn t just consider density inside density) Contain no minimum weight cutsets (lie on either side of fault lines ) Multiple edge-independent paths within High edge-connectivity

56 Lambda Operator Let λ(u,v) be the number of edge-independent paths from node u to node v λ(u,v) is also the minimum number of ties that must be removed from the network in order to disconnect u and v

57 Lambda Sets Definition A set of nodes S is a lambda set if for all a,b,c in S and d not in S, λ(a,b) > λ(c,d) More independent paths to other group members than to outsiders Properties Robust very difficult to disconnect even with intelligent attack Mutually exclusive or wholly inclusive No partially overlapping groups Pure like n-clubs, defined on a single attribute

58 Lambda Sets Non-Trivial LS-Sets {1,2,3,4} {1,2,3,4,5,6,7,8} {9,10,11,12} Non-Trivial Lambda Sets {1,2,3,4} {1,2,3,4,5,6,7,8} {9,10,11,12} {5,6,7,8} 11

59 Subgroups graph clustering Outcome -- clique Process -- Newman- Girvan -- Negopy Process -Hiclus -Kmeans -Newman Outcome -- factions -- comb opt Distance - n-clique - n-clan - n-club Density - k-core - k-plex - ls-set - lambda set - component

60 Newman-Girvan Successively deleting the tie with the most edge betweenness, and identifying components Yields a hierarchical clustering

61 Cohesion family of concepts HA RR Y MI CH AEL PA ULI NE PAT HO BIL DO PA JEN AN JO GE STE BE LLY L N M NIE N LEE HN RY VE RT RUSS HOLLY BILL DON HARRY MICHAEL PAM JENNIE ANN PAULINE PAT CAROL LEE JOHN BRAZEY GERY STEVE BERT RUSS Sum Copyright 2006 Steve Borgatti. All rights reserved. CA RO L BR AZ EY Sum Concept: Relational Centrality Subgroups Network (average )

Cohesive Subgroups. October 4, Based on slides by Steve Borgatti. Modifications (many sensible) by Rich DeJordy

Cohesive Subgroups. October 4, Based on slides by Steve Borgatti. Modifications (many sensible) by Rich DeJordy Cohesive Subgroups October 4, 26 Based on slides by Steve Borgatti. Modifications (many sensible) by Rich DeJordy Before We Start Any questions on cohesion? Why do we care about Cohesive Subgroups 1. They