ggregaton of Socal Networks by Dvsve Clusterng Method mne Louat and Yves Lechaveller INRI Pars-Rocquencourt Rocquencourt, France {lzennyr.da_slva, Yves.Lechevaller, Fabrce.Ross}@nra.fr HCSD Beng October 20 Mare-ude ufaure Centrale Pars, France Mare-ude.ufaure@ecp.fr
Introducton / Motvatons Obectves K-SNP algorthm Our approach Concluson HCSD Beng, October 20
Introducton / motvatons The data manpulated n an enterprse context are structured data (BD) but also unstructured data such as e-mals, documents,.. Graph model s a natural way of representng and modelng structured and unstructured data n a unfed manner. The man advantage of graph model resdes n ts dynamc aspect and ts capablty to represent relatons between ndvduals. However, graph extracted has a huge sze whch makes t dffcult to analyze and vsualze these data also an aggregaton step s needed to have more understandable graphs n order to allow user dscoverng underlyng nformaton and hdden relatonshps between clusters of ndvduals. HCSD Beng, October 20
Obectves Create a data model assocated to a socal networks Propose an aggregatve approach whch reduces ths nformaton. HCSD Beng, October 20
Descrptons of the ndvduals (nodes) In socal networks we have a set of ndvduals descrbed by a vector of varables (numercal, categorcal or symbolc) and a set of relatonshps. Set of ndvduals Set of relatons Set of edges Set of varables HCSD Beng, October 20 { v, } V =, L v n { R, } R =, L { E, } E =, L R p E p u, v V ( u, v) E {, }, q f defned on V ur v = defned on V L
Categorcal varable / Relaton Zachary s karate club dataset (UCI datasets) HCSD Beng, October 20 The relaton color of ndvduals s a categorcal varable because the relaton s transtve But the relaton call wth s not transtve also call wth s not a categorcal varable.
Node vector space model r e 5 v 3 e 6 v 2 e 4 v 5 e e 2 v 4 v HCSD Beng, October 20 e 3 e e e e e e 2 3 4 5 6 [ v v v v ] 2 3 4 v5 0 0 0 0 0 Buld the edge-by-node matrx R = 0 f node v s ncdent w th the 0 0 0 0 0 0 edge 0 0 0 e 0 0 0 0 r 0 0 0 0 b 2 = Rb nodevector = r 0 0 r 2
Node data table model e 5 v 3 e 6 v 2 e 4 v 5 e e 2 e 3 v 4 v Descrpton data table v M v 5 [ L ] a M a 5 L a L q a a q M q 5 V = { v, v, v, v v } 2 3 4, Each ndvdual of V s characterzed by a vector HCSD Beng, October 20 5
K-means approach The node vector v represents a node v wth respect to the edges n the gven graph G=(V,E) The mean vector or the centrod of the node vectors contaned n the cluster C k s The obectve functon mnmzed s Problems : The dmensonal representaton space s hgh HCSD Beng, October 20 g k E = = C k K r v C k k= v C How to add the data table descrbng the nodes? by weght between Q E and Q (obectve functon on descrpton data table)? Ths approach s not realst Q k r g k 2
Dssmlarty approach The dssmlarty between two nodes s determned by the number of edges between them and a descrpton vector of these nodes. Let N R t ( v) = { w V ( u, w) E } { v} the neghborhood set of the node v for the relatonshp R t For each par (n,m) of nodes of a gven the relatonshp R t we compute the contngency table a = N ( m) N ( n), b = N ( m) a, c = N ( n) a, R t R t R t R t Dstances or dssmlartes are defned by a,b,c,d parameters HCSD Beng, October 20
Dssmlarty approach The most popular measures are Eucldan dstance and Jaccard ndex whch are defned by Eucldan dstance d ( n, m) = ( b + c) /( a + b + c + d) Jaccard dstance d 2( n, m) = a /( a + b + c) Remark : wth the node vector representaton Jaccard dstance s defned by: d ( n, m) = r r /( r + r r T T T T 2 r r r ) HCSD Beng, October 20
Dssmlarty clusterng approach On the set of varables we compute dssmlartes between two nodes adapted to the dfferent types of varables (numercal, categorcal, symbolc, functonal) We have a set of data tables also we propose to use multple dssmlarty tables clusterng approach to solve ths problem. F..T De Carvalho, Y. Lechevaller and Flpe M. de Melo (202). Parttonng hard clusterng algorthms based on multple dssmlarty matrces. Pattern Recognton. HCSD Beng, October 20
K-SNP algorthm K-SNP s a: lgorthm for graph aggregaton based on the descrptons of nodes and edges. llows the user to ntervene n the aggregaton procedure. algorthm : step : settng : the user selects varables (descrpton of the nodes), relatons (descrpton of the edges) and fx the sze of the aggregated graph (number of the clusters). step 2: Graph ggregaton procedure conssts of two completely ndependent steps: ggregaton based on varable set : -groupement ggregaton based on relaton set: (,R)-groupement HCSD Beng, October 20
Groupement concepts -groupement ll nodes belongng to a cluster must have the same values on all varables. (, R)-groupement ll nodes belongng to a cluster must have the same lst of neghbor clusters. Y. Tan, R.. Hankns and J. M. Patel (2008). Effcent aggregaton for graph summarzaton. In SIGMOD 08 HCSD Beng, October 20
-groupement step 3 modaltes,b,c C C HCSD Beng, October 20 C B B B
(-R) groupement : selecton step The edge set s added. Select the cluster wll must be splttng We select the cluster by usng obectve functon C C C B B B HCSD Beng, October 20
(-R) groupement : spltng step Dvde the set n subsets where the nodes have the same neghbor clusters C C C B B B HCSD Beng, October 20
(-R) groupement : spltng step C C HCSD Beng, October 20 C B B B
Lmtatons of k-snp Only apples to a homogeneous graph: nodes have the same descrpton ggregaton s very rgd n terms of categorcal varables : Cartesan product of all modaltes. Neghbor clusters : the subsets created must be have the same neghbor clusters Ineffectve wth the presence of a large number categorcal varables and heterogenc relatonshps. ncreases the number of clusters wth small sze HCSD Beng, October 20
Our approach Integraton of the clusterng method "Dynamc clusterng n - groupement step. Use classcal Dynamc clusterng or K-means n case t has no a pror knowledge on the nodes. Use Symbolc Dynamc clusterng on the set of modaltes created by -groupement step (reduce the number of clusters) Proposal two new aggregaton crtera of evaluaton to mprove the qualty of results whle adoptng the prncple of k- SNP n (-R)-groupement step Use the degree of node and centralty crteron HCSD Beng, October 20
Local degree of a node The local degree of the node v assocated wth the relatonshp R and the class C s where Deg ( v) = N ( v) C, N, R t ( v) = Deg ( v) = N ( v) C HCSD Beng, October 20 R R { w V ( u, w) E } { v} The complementary local degree of the node v assocated wth the relatonshp R and the class C s It ncludes the rest of the lnks ssued from v
For a gven partton P = ( C, C2.., Ck ), ths measure Δ evaluates the homogenety of the partton P and determnes the cluster to be dvded. For each relaton R and the cluster C, we denote: I ( C ) = Deg, ( v) Intra-group crteron C v C IE ( C ) = Deg, ( v) Inter-group crteron E C v E C k k I ( C ) Δ = = δ IE ( C ) wth Deg = E R = E R (: ) s the degree of vertex v accordng to the relatonshp R, v HCSD Beng, October 20 Measure of homogenety
(-R) groupement : selecton δ t selecton step R t The algorthm conssts of fndng for each teraton the relatonshp R and the cluster C that mnmze the measure of evaluaton δ untl the cardnal of the partton s equal to K. Choose the cluster * and the relatonshp * such that : * * (, ) = arg mn P, R δ = I ( C ) / IE ( C ) HCSD Beng, October 20 Measure of homogenety
(-R) groupement : spltng step t δ R t On the selected cluster C we fnd the central node v d whch maxmzes the centralty degree d = arg max Deg, ( v) v C C s dvded nto two subgroups accordng to the followng strategy: one contans the central node wth ts neghbors n C, N ( v) R C the other the rest of the group. HCSD Beng, October 20
Elaborated by Mark Newman ths data set contans 05 vertces and 44 edges. HCSD Beng, October 20
Our approach K-SNP HCSD Beng, October 20
Conclusons Development of new evaluaton crtera to mprove the qualty of results by usng the measure of homogenety. For graphs wthout a pror nformaton replace the -groupement by a clusterng step HCSD Beng, October 20
Y. Tan, R.. Hankns and J. M. Patel (2008). Effcent aggregaton for graph summarzaton. In SIGMOD 08 2 Louat et al, (20) Recherche de classes dans les réseaux socaux. SFC 20. 3 F..T De Carvalho, Y. Lechevaller and Flpe M. de Melo (200). Parttonng hard clusterng algorthms based on multple dssmlarty matrces. Pattern Recognton. 4 R. Souss et al, Extracton et analyse de réseaux socaux ssus de bases de données relatonnelles. EGC 20: 37-376 5 R. Godn, R. Mssaou and H. laou, Incremental concept formaton algorthms based on Galos Lattces, Computatonal ntellgence,, n 2, pp. 246 267, (995). HCSD Beng, October 20
HCSD Beng, October 20 Thank,