Idea: Select and rank nodes w.r.t. their relevance or interestingness in large networks.

Similar documents
CS 277: Data Mining. Mining Web Link Structure. CS 277: Data Mining Lectures Analyzing Web Link Structure Padhraic Smyth, UC Irvine

DATA MINING LECTURE 13. Link Analysis Ranking PageRank -- Random walks HITS

Link Analysis Ranking

Online Social Networks and Media. Link Analysis and Web Search

Introduction to Search Engine Technology Introduction to Link Structure Analysis. Ronny Lempel Yahoo Labs, Haifa

CS6220: DATA MINING TECHNIQUES

Link Analysis. Leonid E. Zhukov

Online Social Networks and Media. Link Analysis and Web Search

Centrality measures in graphs or social networks

CS6220: DATA MINING TECHNIQUES

CS249: ADVANCED DATA MINING

Data Mining Recitation Notes Week 3

Data Mining and Matrices

Complex Networks CSYS/MATH 303, Spring, Prof. Peter Dodds

Information Retrieval and Search. Web Linkage Mining. Miłosz Kadziński

LINK ANALYSIS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Communities Via Laplacian Matrices. Degree, Adjacency, and Laplacian Matrices Eigenvectors of Laplacian Matrices

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Cross-lingual and temporal Wikipedia analysis

Inf 2B: Ranking Queries on the WWW

1 Searching the World Wide Web

Link Analysis Information Retrieval and Data Mining. Prof. Matteo Matteucci

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Hyperlinked-Induced Topic Search (HITS) identifies. authorities as good content sources (~high indegree) HITS [Kleinberg 99] considers a web page

Introduction to Data Mining

Slides based on those in:

CS 3750 Advanced Machine Learning. Applications of SVD and PCA (LSA and Link analysis) Cem Akkaya

Thanks to Jure Leskovec, Stanford and Panayiotis Tsaparas, Univ. of Ioannina for slides

Google Page Rank Project Linear Algebra Summer 2012

Complex Social System, Elections. Introduction to Network Analysis 1

Web Ranking. Classification (manual, automatic) Link Analysis (today s lesson)

ECEN 689 Special Topics in Data Science for Communications Networks

Node Centrality and Ranking on Networks

Wiki Definition. Reputation Systems I. Outline. Introduction to Reputations. Yury Lifshits. HITS, PageRank, SALSA, ebay, EigenTrust, VKontakte

PageRank. Ryan Tibshirani /36-662: Data Mining. January Optional reading: ESL 14.10

Thanks to Jure Leskovec, Stanford and Panayiotis Tsaparas, Univ. of Ioannina for slides

Web Structure Mining Nodes, Links and Influence

STA141C: Big Data & High Performance Statistical Computing

CS47300: Web Information Search and Management

Hub, Authority and Relevance Scores in Multi-Relational Data for Query Search

Slide source: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University.

Intelligent Data Analysis. PageRank. School of Computer Science University of Birmingham

0.1 Naive formulation of PageRank

MultiRank and HAR for Ranking Multi-relational Data, Transition Probability Tensors, and Multi-Stochastic Tensors

1998: enter Link Analysis

CS54701 Information Retrieval. Link Analysis. Luo Si. Department of Computer Science Purdue University. Borrowed Slides from Prof.

Link Analysis. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze

Faloutsos, Tong ICDE, 2009

Comparative Network Analysis

As it is not necessarily possible to satisfy this equation, we just ask for a solution to the more general equation

Lecture 12: Link Analysis for Web Retrieval

IR: Information Retrieval

Probability & Computing

Applications to network analysis: Eigenvector centrality indices Lecture notes

Computing PageRank using Power Extrapolation

Overlapping Communities

Page rank computation HPC course project a.y

6.207/14.15: Networks Lectures 4, 5 & 6: Linear Dynamics, Markov Chains, Centralities

Lab 8: Measuring Graph Centrality - PageRank. Monday, November 5 CompSci 531, Fall 2018

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Web Ranking. Classification (manual, automatic) Link Analysis (today s lesson)

6.207/14.15: Networks Lecture 7: Search on Networks: Navigation and Web Search

Mining Newsgroups Using Networks Arising From Social Behavior by Rakesh Agrawal et al. Presented by Will Lee

A Note on Google s PageRank

Link Mining PageRank. From Stanford C246

Node and Link Analysis

Data Mining Techniques

Powerful tool for sampling from complicated distributions. Many use Markov chains to model events that arise in nature.

MAE 298, Lecture 8 Feb 4, Web search and decentralized search on small-worlds

The Google Markov Chain: convergence speed and eigenvalues

CSI 445/660 Part 6 (Centrality Measures for Networks) 6 1 / 68

Analysis of an Optimal Measurement Index Based on the Complex Network

The Dynamic Absorbing Model for the Web

Data and Algorithms of the Web

Google PageRank. Francesco Ricci Faculty of Computer Science Free University of Bozen-Bolzano

Spectral Graph Theory and You: Matrix Tree Theorem and Centrality Metrics

PageRank algorithm Hubs and Authorities. Data mining. Web Data Mining PageRank, Hubs and Authorities. University of Szeged.

Kristina Lerman USC Information Sciences Institute

Model Reduction for Edge-Weighted Personalized PageRank

Online Sampling of High Centrality Individuals in Social Networks

Algebraic Representation of Networks

CS246: Mining Massive Datasets Jure Leskovec, Stanford University.

Results: MCMC Dancers, q=10, n=500

PROBABILISTIC LATENT SEMANTIC ANALYSIS

A New Space for Comparing Graphs

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Mining of Massive Datasets Jure Leskovec, AnandRajaraman, Jeff Ullman Stanford University

On Top-k Structural. Similarity Search. Pei Lee, Laks V.S. Lakshmanan University of British Columbia Vancouver, BC, Canada

Edge-Weighted Personalized PageRank: Breaking a Decade-Old Performance Barrier

Lecture 10. Lecturer: Aleksander Mądry Scribes: Mani Bastani Parizi and Christos Kalaitzis

How does Google rank webpages?

Data Mining and Analysis: Fundamental Concepts and Algorithms

Discovering Contexts and Contextual Outliers Using Random Walks in Graphs

Finding central nodes in large networks

Scaling Neighbourhood Methods

LAPLACIAN MATRIX AND APPLICATIONS

Linear Classifiers (Kernels)

Scribes: Po-Hsuan Wei, William Kuzmaul Editor: Kevin Wu Date: October 18, 2016

arxiv:cond-mat/ v1 3 Sep 2004

Stat 315c: Introduction

Transcription:

Graph Databases and Linked Data So far: Obects are considered as iid (independent and identical diributed) the meaning of obects depends exclusively on the description obects do not influence each other In the following: Link Mining Obects are connected and dependent. Examples: Publications are measures based on citations. obects might depend on any connected obect databases become large networks (knowledge graphs) 44 Node Ranking Idea: Select and rank nodes w.r.t. their relevance or intereingness in large networks. Intereingness might depend on : influence to the complete networks key nodes for network flows Applications: Ranking web sites and web pages Rank researchers in citation networks Rank importance of nodes representing crossing or routers in transportation networks 45

Centrality Measures Idea: Centrality depends on the position of a node to the other nodes w.r.t. networks diance (=co optimal path between two nodes) Let d(v,t) be the length of the shorte path from v to t (v,t V) in G(V,E): Closeness Centrality: CC v d( v, t) Graph Centrality: C G v tv max( d( v, t)) Let be the number of shorte paths from s to t and let (v) be the number of shorte path from s to t containing v. tv Stress Centrality: C S v svtv v Betweenness Centrality: C B ( v) svtv ( v) 46 Centrality Measures Example: Let nodes represent routers in a computer network. If the router having the highe betweenness centrality goes offline the mo direct connections are affected. Computation: Set of all pair shorte paths can be computed in O(n 3 ) time and using O(n 2 ) memory by the Floyd Warshal algorithm. theorem: v is on the shorte path between s and t if and only if d(s,t) = d(s,v)+ d(v,t) v 0 sv vt d( s, t) d( s, v) d( v, t) else to compute the betweenness centrality it is not necessary to compute all paths there are faer solution: O(nm) without edge weights O(nm+n 2 log n) in graphs having edge weights where n = V and m = E in the graph G(V,E) if 47

Computing Betweeness Centrality Basic idea: Start a single source all target search from each node s. The result is a tree (called Dikra tree) containing all shorte paths arting with s. The Dikra tree also induces a diance ranking of all nodes to s. Visit each node v with descending diance to s and count all nodes t lying behind v in the tree ( (v)) and the set of shorte paths from s to t ( ) s v t o t 0 (v)= (v)= t 0 0 =2 48 Algorithm for unweighted graphs() Variables and expressions: S: Stack oring nodes w.r.t to their diance to s Q: Priority Queue for the Dikra search (ordered by the diance to s) P[: Li oring all predecessors of v d[: diance of the shorte path from s to v [: number of shorte paths from s to v [ [: Given then [ s( v) sv [ [ ( w) w: vp[ w] Workflow for each arting node s:. Phase: Algorithm computes the Dikra tree of s 2. Phase: traverse ack S and count the number of nodes behind each visited node v tv sw s 49

Algorithmus für ungewichtet Graphen(2) CB[ := 0 vv for s V S:= empty Stack; P[w] := empty Li wv; [t] :=0 tv; [s]:=; d[t] :=- tv; d[s]:0; Q := empty Queue; Q.push(0,s); while Q not empty do v := Q.pop(); S.push(v); foreach neighbor w of v do if d[w] < 0 then d[w]:=d[+; Q.push(d[w],w); end if if d[w]=d[+ then (w):=(w)+(v) P[w].add(v) end if end for end while [:=0; vv; while S not empty do w:=s.pop(); for vp[w] do [ [ : [ [ w] ; [ w] end for if ws then CB[w]:=CB[w]+[w]; end if end while end for 50 Ranking nodes in hyperlinked Text PageRank: (S.Brin/B. Page 996) important component in ranking algorithms of search engines (in combination with other features) Data is considered a rongly connected, directed network G(V,E). ( e.g. all HTML documents in a search engine) probabiliic surfer performs an infinite random walk. idea: visiting probability = importance of the page v. 5

Anwendungen im Web Mining Computing the PageRank art diribution: p0(u) = / V adacency matrix: E transition prob.: L[ u, probability of page v at time i: diribution vector over all pages: E[ u, E[ u, ] 0 E 0 0 L 0 0 2 0 0 0 2 0 0 p ( u) p i[ L[ u, pi uv T i L pi A C B T Computation by Power Iterations : pi L pi after ca. 20 30 iterations result should be able Solution for none rongly connected graphs:. Remove nodes without outlink 2. Allow umps during traversal 52 Ranking Linked Obects HITS (Kleinberg 998): Hyperlink Induced Topic Search Consider only obects being relevant for q or being linked to relevant pages (in and outlinks). G q (V q,e q ) for query q there are two types of obects : Hubs: link to relevant obects (authorities) Authorities: relevant obects being linked by hubs. => each obect has an authority score and a hub score for each obect u, h[u] denotes its hub score and a[u] its authority score. a good authority is linked by many good hubs a good hubs links to many good authorities. 53

Anwendungen im Web Mining Computing HITS: a vector of authority scores over all obects v V q h vector of hub scores over all obects v V q Computation by mutual iterations: Complete algorithm:. determine relevant obects (root set). a E h Ea (hub score) 2. determine all pages linking relevant obects.(extended set) 3. iterate over all hub and authority scores 4. Order the relevant pages by the authority scores T h (authority score) 54 Link Prediction Input: A graph G(V,E) and 2 nodes v,u V where (v,u) E. Output: Predict the exience of link (v,u) if: the exience is unknown. the link might develop at a future point in time Examples: Links in social networks unknown protein interaction Cuomer product recommendations in bipartite graphs (Collaborative Filtering) 55

Feature based Link Prediction Idea: Use the features of pairs of obects to describe their relationship. Example: Common interes in social networks Co authors do research in the same area proteins have complementary active regions Links do develop by accident, there are reasons which might be found in the feature values Link Prediction: Learn a classifier that maps pairs of feature descriptions to link probabilities Formal: Let u,v V and let F(v),F(u) be their feature descriptions. Then, Link Prediction is the task to learn a function P: (F(v),F(u)) > L. (L is either discrete {link, no link} or real valued [0,..max_Strength]) 56 Topology Based Linke Prediction Problem: Feature based approach do not consider network proximity. Example: Persons having similar interes might not have any contact Proteins might dock but do not appear in the same natural surrounding Solution: Integrate the neighborhood of v and u in G. common neighbors increase the likelihood of a link describe a node by its adacency li or the subnetwork being influenced by the node 57

Link Prediction and Matrix Factorisation Input: Graph G(V,E) with adacency matrix A and let E u E be the set of links with unknown exience or rength. Method: Factorizing A allows to find a latent k dimensional space (k is the rank of A) (Factorization can be done regardless of missing entries) nodes can be expressed in this latent space remapping of the nodes to the V dimensional space fills up the unknown entries E u. Vorgehen: Factorize A in the nkmatrix B while minimizing L(B) the : L( B) a i, a A\ U a 2 a Computation: Gradient descent on the derivate of L(B). Remark: Also applicable to bipartite graphs (cuomer/ product) A\ U a b *, b *, 2 T A BB 58 Dense Subgraph Discovery Find dense subgraphs in a network G(V,E). Definitions of dense : cliques (complete subgraphs) quasi cliques (at lea x % of the edges mu exi) relative density of the surrounding: in node in subgraph G has more links to other node from G than to nodes G \G. Problem: almo all definitions lead to NP hard search problems => heuriic solutions => practical use is limited 59

Graph Cluering class of cluering methods that treat the data set as graph Obect= node; links diance, similarity, reachability diance... usually: only consider the k neare neighbors or an range => directed and undirected network are considered Cluering by weighted k mincut: Partition a graph G into k disunctive subgraphs having similar size while minimizing the number of removed edges. => Weighted k mincut is also an NP hard problem. 60 Graph Cluering: Spectral Cluering built a symmetric adacency matrix S: S sim ( x x ) Transform S into a graph Laplacian matrix L: L= I - D - 2 D - 2 after eigenvalue decomposition of L: S D sim x k 0 ( else i Eigenvectors with eigenvalues = 0, represent connected components Eigenvectors describe the linear weights to represent a cluer representative DB r k i EV i o i x k ) if 6

Conclusions Graph Mining Graph Mining includes new data mining tasks Ranking nodes Link prediction Dense subgraph discovery and community detection Frequent Subgraph Mining Cluering can be formulated as a graph problem Density based cluering: find all connected components where links denote a similarity predicate Spectral cluering weighted k mincut: Partition a graph into k subgraphs while minimizing the weights of the cut edges under size conraints w.r.t. the result subgraphs. 62 Literature Borgwardt K., Kriegel H. P.: Shorte path kernels on graphs. In Proc. Intl. Conf. Data Mining (ICDM 2005), 2005 Borgwardt K.: Graph Kernels, Dissertation im Fach Informatik, Ludwig Maximilians Universität München, 2007 Bunke, H. : Recent developments in graph matching. In ICPR, pages 27 224. 2000 Gärtner, T., Flach, P., and Wrobel: On graph kernels: Hardness results and efficient alternatives. Proc. Annual Conf. Computational Learning Theory, pages 29 43, 2003 Wiener, H.: Structural determination of paraffin boiling points. J. Am. Chem. Soc., 69():7 20, 947 63

Literature Yan X., Han J.: gspan: Graph based subructure pattern mining, In ICDM, 2002. Kuramochi M., Karypis G.: Frequent Subgraph Discovery, In ICDM, 200 Brin S., Page L.: The anatomy of a large scale hypertextual Web search engine, Computer Networks and ISDN Syems, Vol 30, Nr. 7, S.07 7,998 Kleinberg J. M.: Authoritative sources in a hyperlinked environment, Journal of the ACM,Vol.46, Nr. 5, S. 604 632,999 Brandes U.: A faer Algorithm for Betweenness Centrality, Journal of Mathematical Sociology, 25(2):63 77, 200 64