ELEC6910Q Analytics and Systems for Social Media and Big Data Applications Lecture 3 Centrality, Similarity, and Strength Ties Prof. James She james.she@ust.hk 1
Last lecture 2
Selected works from Tutorial #2 From the "Betweenness vs Degree" scatter plot, it seems that in general, the higher the betweenness, the higher the degree. On the other hand, high degree doesn't always mean high betweenness. There exist some nodes that has over 600 total degrees while still having ~0 x 10^5 betweenness. Work from Samuel Chan Work from Tommy Lam 3
Understanding Betweenness 4
Understanding Betweenness 5
Announcements 1. Tutorial #3 tomorrow! 2. More technical / practical programming for data analytics! 3. Make sure you know the Matlab basics and python soon technical challenges/ fun Wk 3 Wk 4 Wk 5 time 6
Summary of this lecture 1. Centrality continued 2. Similarity and Tie Strength of Nodes 3. Introduction to Recommendation 7
Centrality continued 8
Recall: Adjacent Matrix Social graph Adjacent matrix 9
Recall: Betweenness Centrality Intuition: how many pairs of individuals would have to go through you in order to reach someone in the min. # of hops? who has higher betweenness, X or Y? Y X 10
Betweenness Centrality Or, å C ( i) = g ( i) / B j< k jk g jk where! "# = # of geodesics connecting jk;! "# = # that actor i is on. 11
Betweenness Centrality C why do C and D each have betweenness 1? A B E They are both on shortest paths for pairs (A,E), and (B,E), and so must share credit: ½+½ = 1 D 12
Betweenness vs Degree Centrality Data visualization Nodes are sized by degree, and colored by betweenness. Can you spot nodes with high betweenness but relatively low degree? What about high degree but relatively low betweenness? 13
Closeness Centrality What if? the node importance is not simply due to: the number of direct friends (degree centrality) or being in between others (betweenness centrality) But Being close to everyone (closeness centrality)
Closeness Centrality 15
0.4 4 10 4 4 3 2 1 1 ), ( ) ( 1 1 1 1 = ú û ù ê ë é = ú û ù ê ë é + + + = ú ú ú ú û ù ê ê ê ê ë é - = - - - = å n v v g v C n j j i i C A B C E D Closeness Centrality
Closeness Centrality
Closeness vs Degree Centrality Data Visualization Degree denoted by size Closeness denoted by color Nodes with high closeness are located in the middle of the graph
Eigenvector Centrality An aggregated metric to characterize the "global" importance of a node as opposed to "local" e.g., Page Rank (used in Google s early search engine) Node importance due to the centralities of its neighbors 19
Eigenvector Centrality Modified version: PageRank PageRank: used by Google s search engine to list the ranking of web pages in terms of their relative importance or from wikipedia.com The idea: A page, p i, is relatively more important with a higher PR(p i,), when it is linked by many other important pages 20
Eigenvector Centrality Consider the graph with a 5x5 adjacency matrix, A Let x, be a 5x1 centrality vector of nodes (in terms of degree) 21
Eigenvector Centrality Multiply the matrix A by vector x: The resulting value is the sum of the centrality of neighbors 22
Eigenvector Centrality What if the process keep repeating? x is updated repeatedly Eventually reach an equilibrium when the in/out is balanced with neighbors. The final x = {x ", x $, x %, x &, x ' } captures the centrality 23
Eigenvector Centrality Recall linear algebra basics Eigenvectors (for a square m m matrix A) () =!) Example (right) eigenvector ) eigenvalue Have at most m distinct solutions,! "! $! ' Eigenvectors for distinct eigenvalues are orthogonal l 1 ¹ l2 Þ v1 v2 = 0 24
Eigenvector Centrality Find the solution Eigen value decomposition! diagonal Columns of U are eigenvectors of Diagonal elements of are eigenvalues of!! the effect of largest eigenvalues is largest 25
Eigenvector Centrality Eigenvector corresponding to the largest eigenvalue! = 0 1 0 0 0 1 0 1 1 0 0 1 0 1 1 0 1 1 0 1 0 0 1 1 0 Ax=&x e.g., Matlab: [vector, value] = eig(a) 0.180 0.475 0.537 0.537 0.407 Node C and D Eigenvalue decomposition 26
Importance of Nodes In summary 7 Closeness Centrality: They have shortest path to other nodes 6 5 Degree Centrality: It has most direct connections 1 8 9 10 Eigenvector Centrality: Its neighbors are the most important 3 4 2 Betweenness Centrality: It connects the two nodes to all other nodes 27
10 min break 29
Strength of Ties and Similarity Measurement Description Applications Similarity Indicates the similarity between nodes by their common attributes (e.g., contacts or interest) Determine the tie strength, community of nodes, recommendations, etc. Tie Strength Indicates the strength of link between nodes. (e.g. frequency of interactions and duration of encounter) Determine if a link weak/strong connection, other hidden and missing info. 30
Similarity Jaccard similarity!(# $, # & ) = ) * ), ) * ), Used to quantify the similarity between 2 sets # $ and # &. 0 J(U i,u j ) 1 1: the 2 sets are identical, 0: the 2 sets have no common elements. 31
Recall User Profiles in Social Media 32
Learning from Profiles Tie strength based on attributes Descriptive attributes of nodes (e.g., interests, common friends, etc.) Consider nodes similarity based on these attribute e.g., User A and B have more common interests Strong tie (high similarity) user A Reading Film Painting Swimming Weak tie (low similarity) Reading Film Singing user B user C TV Game Hiking 33
Learning from Interactions Tie strength based on contacts Structural features, # of common friends Jaccard similarity of the friend sets More common friends è higher similarity User A and B have more common friends Strong tie (high similarity) Friend of: D, E, F User B User A User C Friend of: D, E, F Weak tie (low similarity) Friend of: D, G, H 34
Similarity Jaccard similarity example favorite interests, User A U A = {Reading, Film, Painting, Swimming} User B U B = {Reading, Film, Painting, Singing} User C U C = {Reading, TV games} J U A, U % = {Reading, Painting, Film} {Reading, Film, Painting, Swimming, Singing} # of interests in common Total # of interest that two people have = 3 5 J U A, U, = {Reading} {Reading, Film, Painting, Swimming, TV games} = 1 5 35
Similarity Jaccard similarity special issues J U A, U % = {Reading, Film} {Reading, Film, Painting, Swimming, Singing} = 2 5 J U A, U % = {Reading, Film} {Reading, Film, Painting, Swimming, Singing, TV games} = 2 6 = 1 3 2 possible choices of denominator: 1) union of the 2 users interests; 2) or all possible interests? PS: if the former one is used, some information may be lost 36
Similarity Cosine similarity "($ %, $ ' ) = To quantify the similarity between two sets. 0 "($ %, $ ' ) 1: 1: the 2 sets are identical, 0: the 2 sets have no common elements. 37
Similarity Cosine similarity example Restaurant visit freq.: User A U A = {LG1:3, Café:5, McDonalds:5} User B U B = {LG1:2, Café:6, McDonalds:4} User C U C = {LG1:10, Café:1, McDonalds:0}! " #, " % = 3 2 + 5 6 + 5 4 3. + 5. + 5. {2. + 6. + 4. } = C " #, " 5 = 56 59 56 = 0.97 3 10 + 5 1 + 5 0 3. + 5. + 5. {10. + 1. + 0. } = 35 59 101 = 0.45 38
Ties Strength Weak or Strong? Now, connections (links) are not the same strength Interpersonal social networks in real-life: Strong ties (close friend) Weak ties (acquaintances) Community formation and information diffusion Strength of Weak Ties (Granovetter, 1973) Occasional encounters with distant acquaintance provides new opportunity in jobs search 39
Weak and Strong Ties 40
How does strength of a tie influence diffusion? M. S. Granovetter: The Strength of Weak Ties, AJS, 1973: Finding a job through a contact which see Frequently (2+/week) 16.7% Occasionally (more than once a year but < 2/week) 55.6% Rarely 27.8% But length of path is short a person directly works for/is the employer or connected directly to employer PS: Any real life experience? 41
Zachary s Karate Club Dataset Zachary s Karate Club is a dataset that describes the social relationships by Wayne W. Zachary in his paper An Information Flow Model for Conflict and Fission in Small Groups 34 nodes representing the member of the club 77 edges representing the friendship between the members src: http://networkdata.ics.uci.edu/netdata/html/zacharykarate.html 42
Out-class activity 3 (due before Tutorial #3) Read the paper An Information Flow Model for Conflict and Fission in Small Groups (3400+ citations in 2018) http://course.ece.ust.hk/elec6910q/referencepaper/an_information_flow_model_for_ Conflict_and_Fission.pdf 1. 3 points about their contributions 2. 3 possible extensions with we learnt from the course 3. Submit by Facebook post before tomorrow noon 43
End of Lecture ( ) Questions / Comments? 44
Recommendation 45
Types of Recommendations 1. Image (Flickr) 2. Video (YouTube, Youku, Netflix) 3. Cuisine (Openrice, Dianping) 4. Friend/Member/Articles (Facebook, Renren, WeChat, Line, etc.) 5. Webpage/ bookmarks (Delicious) 6. Product (ebay, Amazon) 46
Recommendation Inputs 1. When users interest/preference are specified by users, recommend by criteria. 2. Recommend through social data, history, behavioral data with machine learning and data mining techniques. Netflix recommendation system example https://www.youtube.com/watch?v=nq2qtatuf7u 47
Common Techniques in Social Networks 1. Collaborative filtering (CF) Understand user properties for recommendation e.g., tagging for user generated content 2. K-NN based recommendation Understand the item and user properties for recommendation Similarities among items and users are calculated 48
Collaborative Filtering 49
Collaborative Filtering (CF) 1. The most prominent approach used by large, commercial e-commerce sites well-understood, various algorithms and variations exist applicable in many domains (books, movies...) 2. Basic assumption and idea customers tastes does not change much with time 50
51 Collaborative Filtering (CF) Leveraging similarity abcd How it works 1. Should item 1 be recommended to Tim from the user-item matrix? Item to Item 2. 2 approaches: user to user (calculate user similarity ) item to item (calculate item similarity) User to User
Collaborative Filtering (CF) User-to-user Finding similar users (also similar tastes) e.g., Jaccard similarity abcd Jane and Tim both liked item 2 and disliked item 3 they have similar tastes!(# $, # & ) = # $ # & # $ # & Item 1 is recommended to Tim (item 1 is liked by Jane) 52
Collaborative Filtering (CF) User-to-user User-based Nearest Neighbor Neighbor = similar users Generate a prediction for an item i by analyzing ratings for i from users in u s neighborhood pred( u, i) = r u + å vìneighbors( u) å vìneighbors( u) sim( u, v) ( r vi sim( u, v) - r v ) 53
54 Collaborative Filtering (CF) Item-to-item Finding items that have similar subscribers Dom and Sandra are 2 users both like item 1 & 4 Users like item 4, also like item 1 at the same time item 1 will be recommended to Tim.
Collaborative Filtering (CF) Item-to-item Item-Based Nearest Neighbor Generate predictions based on similarities between items. Prediction for a user u and item i is composed of a weighted sum of the user u s ratings for items most similar to i. pred ( u, i) = å jîrateditems ( u) å jîrateditems ( u) sim( i, j) r sim( i, j) ui 55
Example: Friendship Recommendation 1. Similarity among users can be found through useritem matrix 2. Recommend Don to Jane (as online friend), since they have most similar tastes (common interests) Jane Tim Don 56
K-nearest neighbors (K-NN) 57
Recommendation is Classification Problems 1. 2 classes: like or dislike 2. Recommendation: find items that will be liked 3. Example: which clothes will be liked? 58
K-nearest neighbors (K-NN) 1. The simplest machine learning algorithm for classification 2. Assign an object to the class most common among its k nearest neighbors after some voting mechanism; 3. Different neighbors could have different weights e.g., the nearest one has a higher weight (by similarity) 59
K-NN: classifying a fish 2 classes: sea bass and salmon k = 3, (2 sea bass, 1 salmon) Classified as sea bass 3 classes: sea bass, salmon and eel k = 5, (3 sea bass, 1 eel, 1 salmon) Classified as sea bass 60
K-NN: An algorithm to find users/object with similar tastes/subscribers Step 2: Find the k nearest neighbors, the k items with highest similarity like? disliked disliked Step 1: Collect a set of labeled samples, liked liked liked liked liked Tim liked disliked If K=3, then in this case query instance will be classified as positive since 2 nearest neighbors are positive Step 3: Classify the input item. e.g., like or dislike the item 61