Aggregation of Social Networks by Divisive Clustering Method

Similar documents
Spectral Clustering. Shannon Quinn

Cluster Validation Determining Number of Clusters. Umut ORHAN, PhD.

VQ widely used in coding speech, image, and video

MDL-Based Unsupervised Attribute Ranking

Clustering gene expression data & the EM algorithm

Problem Set 9 Solutions

Finding Dense Subgraphs in G(n, 1/2)

Supporting Information

Markov Chain Monte Carlo Lecture 6

CS 331 DESIGN AND ANALYSIS OF ALGORITHMS DYNAMIC PROGRAMMING. Dr. Daisy Tang

Spatial Statistics and Analysis Methods (for GEOG 104 class).

Global Optimization of Truss. Structure Design INFORMS J. N. Hooker. Tallys Yunes. Slide 1

U.C. Berkeley CS294: Beyond Worst-Case Analysis Luca Trevisan September 5, 2017

Outline and Reading. Dynamic Programming. Dynamic Programming revealed. Computing Fibonacci. The General Dynamic Programming Technique

p 1 c 2 + p 2 c 2 + p 3 c p m c 2

2.3 Nilpotent endomorphisms

Report on Image warping

Turbulence classification of load data by the frequency and severity of wind gusts. Oscar Moñux, DEWI GmbH Kevin Bleibler, DEWI GmbH

NUMERICAL DIFFERENTIATION

College of Computer & Information Science Fall 2009 Northeastern University 20 October 2009

Grover s Algorithm + Quantum Zeno Effect + Vaidman

Singular Value Decomposition: Theory and Applications

Dynamic Programming. Preview. Dynamic Programming. Dynamic Programming. Dynamic Programming (Example: Fibonacci Sequence)

Lecture Notes on Linear Regression

Kernel Methods and SVMs Extension

Complete subgraphs in multipartite graphs

K means B d ase Consensus Cluste i r ng Dr. Dr Junjie Wu Beihang University

8.4 COMPLEX VECTOR SPACES AND INNER PRODUCTS

Stanford University CS359G: Graph Partitioning and Expanders Handout 4 Luca Trevisan January 13, 2011

Tutorial 2. COMP4134 Biometrics Authentication. February 9, Jun Xu, Teaching Asistant

EM and Structure Learning

Instance-Based Learning (a.k.a. memory-based learning) Part I: Nearest Neighbor Classification

LINEAR REGRESSION ANALYSIS. MODULE IX Lecture Multicollinearity

Week 2. This week, we covered operations on sets and cardinality.

For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results.

Computing Correlated Equilibria in Multi-Player Games

Statistical pattern recognition

Chapter 5. Solution of System of Linear Equations. Module No. 6. Solution of Inconsistent and Ill Conditioned Systems

Negative Binomial Regression

ANSWERS. Problem 1. and the moment generating function (mgf) by. defined for any real t. Use this to show that E( U) var( U)

Probability Theory (revisited)

CHAPTER-5 INFORMATION MEASURE OF FUZZY MATRIX AND FUZZY BINARY RELATION

The Minimum Universal Cost Flow in an Infeasible Flow Network

Numerical Solution of Ordinary Differential Equations

Outline. Communication. Bellman Ford Algorithm. Bellman Ford Example. Bellman Ford Shortest Path [1]

5 The Rational Canonical Form

An Integrated Asset Allocation and Path Planning Method to to Search for a Moving Target in in a Dynamic Environment

Fuzzy Boundaries of Sample Selection Model

Bulgarian Academy of Sciences. 22 July, Index

Mean Field / Variational Approximations

A New Scrambling Evaluation Scheme based on Spatial Distribution Entropy and Centroid Difference of Bit-plane

Perceptual Organization (IV)

Common loop optimizations. Example to improve locality. Why Dependence Analysis. Data Dependence in Loops. Goal is to find best schedule:

Markov Chain Monte Carlo (MCMC), Gibbs Sampling, Metropolis Algorithms, and Simulated Annealing Bioinformatics Course Supplement

Calculation of time complexity (3%)

Support Vector Machines. Vibhav Gogate The University of Texas at dallas

NP-Completeness : Proofs

Hidden Markov Models & The Multivariate Gaussian (10/26/04)

10-701/ Machine Learning, Fall 2005 Homework 3

A New Evolutionary Computation Based Approach for Learning Bayesian Network

DEMO #8 - GAUSSIAN ELIMINATION USING MATHEMATICA. 1. Matrices in Mathematica

Chapter 9: Statistical Inference and the Relationship between Two Variables

Lecture 5 Decoding Binary BCH Codes

Structure and Drive Paul A. Jensen Copyright July 20, 2003

Module 3 LOSSY IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

Min Cut, Fast Cut, Polynomial Identities

Pattern Classification

Generalized Linear Methods

Multilayer Perceptrons and Backpropagation. Perceptrons. Recap: Perceptrons. Informatics 1 CG: Lecture 6. Mirella Lapata

The Study of Teaching-learning-based Optimization Algorithm

Lecture 10 Support Vector Machines II

risk and uncertainty assessment

Combining Constraint Programming and Integer Programming

CS 2750 Machine Learning. Lecture 5. Density estimation. CS 2750 Machine Learning. Announcements

Finite Mixture Models and Expectation Maximization. Most slides are from: Dr. Mario Figueiredo, Dr. Anil Jain and Dr. Rong Jin

An Algorithm to Solve the Inverse Kinematics Problem of a Robotic Manipulator Based on Rotation Vectors

Edge Isoperimetric Inequalities

Feature Selection: Part 1

GEMINI GEneric Multimedia INdexIng

Linear Regression Analysis: Terminology and Notation

SCALARS AND VECTORS All physical quantities in engineering mechanics are measured using either scalars or vectors.

Linear Feature Engineering 11

A New Refinement of Jacobi Method for Solution of Linear System Equations AX=b

A Network Intrusion Detection Method Based on Improved K-means Algorithm

THE ARIMOTO-BLAHUT ALGORITHM FOR COMPUTATION OF CHANNEL CAPACITY. William A. Pearlman. References: S. Arimoto - IEEE Trans. Inform. Thy., Jan.

Copyright 2017 by Taylor Enterprises, Inc., All Rights Reserved. Adjusted Control Limits for U Charts. Dr. Wayne A. Taylor

Efficient, General Point Cloud Registration with Kernel Feature Maps

MODELING TRAFFIC LIGHTS IN INTERSECTION USING PETRI NETS

LECTURE 9 CANONICAL CORRELATION ANALYSIS

FAULT TEMPLATE EXTRACTION FROM INDUSTRIAL ALARM FLOODS. Sylvie Charbonnier, Nabil Bouchair, Philippe Gayet

Bi-Relational Network Analysis Using a Fast Random Walk with Restart

More metrics on cartesian products

Copyright 2017 by Taylor Enterprises, Inc., All Rights Reserved. Adjusted Control Limits for P Charts. Dr. Wayne A. Taylor

Week 5: Neural Networks

Natural Language Processing and Information Retrieval

An Optimization Model for Routing in Low Earth Orbit Satellite Constellations

ESCI 341 Atmospheric Thermodynamics Lesson 10 The Physical Meaning of Entropy

Transfer Functions. Convenient representation of a linear, dynamic model. A transfer function (TF) relates one input and one output: ( ) system

Chapter 12. Ordinary Differential Equation Boundary Value (BV) Problems

Resource Allocation with a Budget Constraint for Computing Independent Tasks in the Cloud

Transcription:

ggregaton of Socal Networks by Dvsve Clusterng Method mne Louat and Yves Lechaveller INRI Pars-Rocquencourt Rocquencourt, France {lzennyr.da_slva, Yves.Lechevaller, Fabrce.Ross}@nra.fr HCSD Beng October 20 Mare-ude ufaure Centrale Pars, France Mare-ude.ufaure@ecp.fr

Introducton / Motvatons Obectves K-SNP algorthm Our approach Concluson HCSD Beng, October 20

Introducton / motvatons The data manpulated n an enterprse context are structured data (BD) but also unstructured data such as e-mals, documents,.. Graph model s a natural way of representng and modelng structured and unstructured data n a unfed manner. The man advantage of graph model resdes n ts dynamc aspect and ts capablty to represent relatons between ndvduals. However, graph extracted has a huge sze whch makes t dffcult to analyze and vsualze these data also an aggregaton step s needed to have more understandable graphs n order to allow user dscoverng underlyng nformaton and hdden relatonshps between clusters of ndvduals. HCSD Beng, October 20

Obectves Create a data model assocated to a socal networks Propose an aggregatve approach whch reduces ths nformaton. HCSD Beng, October 20

Descrptons of the ndvduals (nodes) In socal networks we have a set of ndvduals descrbed by a vector of varables (numercal, categorcal or symbolc) and a set of relatonshps. Set of ndvduals Set of relatons Set of edges Set of varables HCSD Beng, October 20 { v, } V =, L v n { R, } R =, L { E, } E =, L R p E p u, v V ( u, v) E {, }, q f defned on V ur v = defned on V L

Categorcal varable / Relaton Zachary s karate club dataset (UCI datasets) HCSD Beng, October 20 The relaton color of ndvduals s a categorcal varable because the relaton s transtve But the relaton call wth s not transtve also call wth s not a categorcal varable.

Node vector space model r e 5 v 3 e 6 v 2 e 4 v 5 e e 2 v 4 v HCSD Beng, October 20 e 3 e e e e e e 2 3 4 5 6 [ v v v v ] 2 3 4 v5 0 0 0 0 0 Buld the edge-by-node matrx R = 0 f node v s ncdent w th the 0 0 0 0 0 0 edge 0 0 0 e 0 0 0 0 r 0 0 0 0 b 2 = Rb nodevector = r 0 0 r 2

Node data table model e 5 v 3 e 6 v 2 e 4 v 5 e e 2 e 3 v 4 v Descrpton data table v M v 5 [ L ] a M a 5 L a L q a a q M q 5 V = { v, v, v, v v } 2 3 4, Each ndvdual of V s characterzed by a vector HCSD Beng, October 20 5

K-means approach The node vector v represents a node v wth respect to the edges n the gven graph G=(V,E) The mean vector or the centrod of the node vectors contaned n the cluster C k s The obectve functon mnmzed s Problems : The dmensonal representaton space s hgh HCSD Beng, October 20 g k E = = C k K r v C k k= v C How to add the data table descrbng the nodes? by weght between Q E and Q (obectve functon on descrpton data table)? Ths approach s not realst Q k r g k 2

Dssmlarty approach The dssmlarty between two nodes s determned by the number of edges between them and a descrpton vector of these nodes. Let N R t ( v) = { w V ( u, w) E } { v} the neghborhood set of the node v for the relatonshp R t For each par (n,m) of nodes of a gven the relatonshp R t we compute the contngency table a = N ( m) N ( n), b = N ( m) a, c = N ( n) a, R t R t R t R t Dstances or dssmlartes are defned by a,b,c,d parameters HCSD Beng, October 20

Dssmlarty approach The most popular measures are Eucldan dstance and Jaccard ndex whch are defned by Eucldan dstance d ( n, m) = ( b + c) /( a + b + c + d) Jaccard dstance d 2( n, m) = a /( a + b + c) Remark : wth the node vector representaton Jaccard dstance s defned by: d ( n, m) = r r /( r + r r T T T T 2 r r r ) HCSD Beng, October 20

Dssmlarty clusterng approach On the set of varables we compute dssmlartes between two nodes adapted to the dfferent types of varables (numercal, categorcal, symbolc, functonal) We have a set of data tables also we propose to use multple dssmlarty tables clusterng approach to solve ths problem. F..T De Carvalho, Y. Lechevaller and Flpe M. de Melo (202). Parttonng hard clusterng algorthms based on multple dssmlarty matrces. Pattern Recognton. HCSD Beng, October 20

K-SNP algorthm K-SNP s a: lgorthm for graph aggregaton based on the descrptons of nodes and edges. llows the user to ntervene n the aggregaton procedure. algorthm : step : settng : the user selects varables (descrpton of the nodes), relatons (descrpton of the edges) and fx the sze of the aggregated graph (number of the clusters). step 2: Graph ggregaton procedure conssts of two completely ndependent steps: ggregaton based on varable set : -groupement ggregaton based on relaton set: (,R)-groupement HCSD Beng, October 20

Groupement concepts -groupement ll nodes belongng to a cluster must have the same values on all varables. (, R)-groupement ll nodes belongng to a cluster must have the same lst of neghbor clusters. Y. Tan, R.. Hankns and J. M. Patel (2008). Effcent aggregaton for graph summarzaton. In SIGMOD 08 HCSD Beng, October 20

-groupement step 3 modaltes,b,c C C HCSD Beng, October 20 C B B B

(-R) groupement : selecton step The edge set s added. Select the cluster wll must be splttng We select the cluster by usng obectve functon C C C B B B HCSD Beng, October 20

(-R) groupement : spltng step Dvde the set n subsets where the nodes have the same neghbor clusters C C C B B B HCSD Beng, October 20

(-R) groupement : spltng step C C HCSD Beng, October 20 C B B B

Lmtatons of k-snp Only apples to a homogeneous graph: nodes have the same descrpton ggregaton s very rgd n terms of categorcal varables : Cartesan product of all modaltes. Neghbor clusters : the subsets created must be have the same neghbor clusters Ineffectve wth the presence of a large number categorcal varables and heterogenc relatonshps. ncreases the number of clusters wth small sze HCSD Beng, October 20

Our approach Integraton of the clusterng method "Dynamc clusterng n - groupement step. Use classcal Dynamc clusterng or K-means n case t has no a pror knowledge on the nodes. Use Symbolc Dynamc clusterng on the set of modaltes created by -groupement step (reduce the number of clusters) Proposal two new aggregaton crtera of evaluaton to mprove the qualty of results whle adoptng the prncple of k- SNP n (-R)-groupement step Use the degree of node and centralty crteron HCSD Beng, October 20

Local degree of a node The local degree of the node v assocated wth the relatonshp R and the class C s where Deg ( v) = N ( v) C, N, R t ( v) = Deg ( v) = N ( v) C HCSD Beng, October 20 R R { w V ( u, w) E } { v} The complementary local degree of the node v assocated wth the relatonshp R and the class C s It ncludes the rest of the lnks ssued from v

For a gven partton P = ( C, C2.., Ck ), ths measure Δ evaluates the homogenety of the partton P and determnes the cluster to be dvded. For each relaton R and the cluster C, we denote: I ( C ) = Deg, ( v) Intra-group crteron C v C IE ( C ) = Deg, ( v) Inter-group crteron E C v E C k k I ( C ) Δ = = δ IE ( C ) wth Deg = E R = E R (: ) s the degree of vertex v accordng to the relatonshp R, v HCSD Beng, October 20 Measure of homogenety

(-R) groupement : selecton δ t selecton step R t The algorthm conssts of fndng for each teraton the relatonshp R and the cluster C that mnmze the measure of evaluaton δ untl the cardnal of the partton s equal to K. Choose the cluster * and the relatonshp * such that : * * (, ) = arg mn P, R δ = I ( C ) / IE ( C ) HCSD Beng, October 20 Measure of homogenety

(-R) groupement : spltng step t δ R t On the selected cluster C we fnd the central node v d whch maxmzes the centralty degree d = arg max Deg, ( v) v C C s dvded nto two subgroups accordng to the followng strategy: one contans the central node wth ts neghbors n C, N ( v) R C the other the rest of the group. HCSD Beng, October 20

Elaborated by Mark Newman ths data set contans 05 vertces and 44 edges. HCSD Beng, October 20

Our approach K-SNP HCSD Beng, October 20

Conclusons Development of new evaluaton crtera to mprove the qualty of results by usng the measure of homogenety. For graphs wthout a pror nformaton replace the -groupement by a clusterng step HCSD Beng, October 20

Y. Tan, R.. Hankns and J. M. Patel (2008). Effcent aggregaton for graph summarzaton. In SIGMOD 08 2 Louat et al, (20) Recherche de classes dans les réseaux socaux. SFC 20. 3 F..T De Carvalho, Y. Lechevaller and Flpe M. de Melo (200). Parttonng hard clusterng algorthms based on multple dssmlarty matrces. Pattern Recognton. 4 R. Souss et al, Extracton et analyse de réseaux socaux ssus de bases de données relatonnelles. EGC 20: 37-376 5 R. Godn, R. Mssaou and H. laou, Incremental concept formaton algorthms based on Galos Lattces, Computatonal ntellgence,, n 2, pp. 246 267, (995). HCSD Beng, October 20

HCSD Beng, October 20 Thank,