Graph Databases and Linked Data So far: Obects are considered as iid (independent and identical diributed) the meaning of obects depends exclusively on the description obects do not influence each other In the following: Link Mining Obects are connected and dependent. Examples: Publications are measures based on citations. obects might depend on any connected obect databases become large networks (knowledge graphs) 44 Node Ranking Idea: Select and rank nodes w.r.t. their relevance or intereingness in large networks. Intereingness might depend on : influence to the complete networks key nodes for network flows Applications: Ranking web sites and web pages Rank researchers in citation networks Rank importance of nodes representing crossing or routers in transportation networks 45
Centrality Measures Idea: Centrality depends on the position of a node to the other nodes w.r.t. networks diance (=co optimal path between two nodes) Let d(v,t) be the length of the shorte path from v to t (v,t V) in G(V,E): Closeness Centrality: CC v d( v, t) Graph Centrality: C G v tv max( d( v, t)) Let be the number of shorte paths from s to t and let (v) be the number of shorte path from s to t containing v. tv Stress Centrality: C S v svtv v Betweenness Centrality: C B ( v) svtv ( v) 46 Centrality Measures Example: Let nodes represent routers in a computer network. If the router having the highe betweenness centrality goes offline the mo direct connections are affected. Computation: Set of all pair shorte paths can be computed in O(n 3 ) time and using O(n 2 ) memory by the Floyd Warshal algorithm. theorem: v is on the shorte path between s and t if and only if d(s,t) = d(s,v)+ d(v,t) v 0 sv vt d( s, t) d( s, v) d( v, t) else to compute the betweenness centrality it is not necessary to compute all paths there are faer solution: O(nm) without edge weights O(nm+n 2 log n) in graphs having edge weights where n = V and m = E in the graph G(V,E) if 47
Computing Betweeness Centrality Basic idea: Start a single source all target search from each node s. The result is a tree (called Dikra tree) containing all shorte paths arting with s. The Dikra tree also induces a diance ranking of all nodes to s. Visit each node v with descending diance to s and count all nodes t lying behind v in the tree ( (v)) and the set of shorte paths from s to t ( ) s v t o t 0 (v)= (v)= t 0 0 =2 48 Algorithm for unweighted graphs() Variables and expressions: S: Stack oring nodes w.r.t to their diance to s Q: Priority Queue for the Dikra search (ordered by the diance to s) P[: Li oring all predecessors of v d[: diance of the shorte path from s to v [: number of shorte paths from s to v [ [: Given then [ s( v) sv [ [ ( w) w: vp[ w] Workflow for each arting node s:. Phase: Algorithm computes the Dikra tree of s 2. Phase: traverse ack S and count the number of nodes behind each visited node v tv sw s 49
Algorithmus für ungewichtet Graphen(2) CB[ := 0 vv for s V S:= empty Stack; P[w] := empty Li wv; [t] :=0 tv; [s]:=; d[t] :=- tv; d[s]:0; Q := empty Queue; Q.push(0,s); while Q not empty do v := Q.pop(); S.push(v); foreach neighbor w of v do if d[w] < 0 then d[w]:=d[+; Q.push(d[w],w); end if if d[w]=d[+ then (w):=(w)+(v) P[w].add(v) end if end for end while [:=0; vv; while S not empty do w:=s.pop(); for vp[w] do [ [ : [ [ w] ; [ w] end for if ws then CB[w]:=CB[w]+[w]; end if end while end for 50 Ranking nodes in hyperlinked Text PageRank: (S.Brin/B. Page 996) important component in ranking algorithms of search engines (in combination with other features) Data is considered a rongly connected, directed network G(V,E). ( e.g. all HTML documents in a search engine) probabiliic surfer performs an infinite random walk. idea: visiting probability = importance of the page v. 5
Anwendungen im Web Mining Computing the PageRank art diribution: p0(u) = / V adacency matrix: E transition prob.: L[ u, probability of page v at time i: diribution vector over all pages: E[ u, E[ u, ] 0 E 0 0 L 0 0 2 0 0 0 2 0 0 p ( u) p i[ L[ u, pi uv T i L pi A C B T Computation by Power Iterations : pi L pi after ca. 20 30 iterations result should be able Solution for none rongly connected graphs:. Remove nodes without outlink 2. Allow umps during traversal 52 Ranking Linked Obects HITS (Kleinberg 998): Hyperlink Induced Topic Search Consider only obects being relevant for q or being linked to relevant pages (in and outlinks). G q (V q,e q ) for query q there are two types of obects : Hubs: link to relevant obects (authorities) Authorities: relevant obects being linked by hubs. => each obect has an authority score and a hub score for each obect u, h[u] denotes its hub score and a[u] its authority score. a good authority is linked by many good hubs a good hubs links to many good authorities. 53
Anwendungen im Web Mining Computing HITS: a vector of authority scores over all obects v V q h vector of hub scores over all obects v V q Computation by mutual iterations: Complete algorithm:. determine relevant obects (root set). a E h Ea (hub score) 2. determine all pages linking relevant obects.(extended set) 3. iterate over all hub and authority scores 4. Order the relevant pages by the authority scores T h (authority score) 54 Link Prediction Input: A graph G(V,E) and 2 nodes v,u V where (v,u) E. Output: Predict the exience of link (v,u) if: the exience is unknown. the link might develop at a future point in time Examples: Links in social networks unknown protein interaction Cuomer product recommendations in bipartite graphs (Collaborative Filtering) 55
Feature based Link Prediction Idea: Use the features of pairs of obects to describe their relationship. Example: Common interes in social networks Co authors do research in the same area proteins have complementary active regions Links do develop by accident, there are reasons which might be found in the feature values Link Prediction: Learn a classifier that maps pairs of feature descriptions to link probabilities Formal: Let u,v V and let F(v),F(u) be their feature descriptions. Then, Link Prediction is the task to learn a function P: (F(v),F(u)) > L. (L is either discrete {link, no link} or real valued [0,..max_Strength]) 56 Topology Based Linke Prediction Problem: Feature based approach do not consider network proximity. Example: Persons having similar interes might not have any contact Proteins might dock but do not appear in the same natural surrounding Solution: Integrate the neighborhood of v and u in G. common neighbors increase the likelihood of a link describe a node by its adacency li or the subnetwork being influenced by the node 57
Link Prediction and Matrix Factorisation Input: Graph G(V,E) with adacency matrix A and let E u E be the set of links with unknown exience or rength. Method: Factorizing A allows to find a latent k dimensional space (k is the rank of A) (Factorization can be done regardless of missing entries) nodes can be expressed in this latent space remapping of the nodes to the V dimensional space fills up the unknown entries E u. Vorgehen: Factorize A in the nkmatrix B while minimizing L(B) the : L( B) a i, a A\ U a 2 a Computation: Gradient descent on the derivate of L(B). Remark: Also applicable to bipartite graphs (cuomer/ product) A\ U a b *, b *, 2 T A BB 58 Dense Subgraph Discovery Find dense subgraphs in a network G(V,E). Definitions of dense : cliques (complete subgraphs) quasi cliques (at lea x % of the edges mu exi) relative density of the surrounding: in node in subgraph G has more links to other node from G than to nodes G \G. Problem: almo all definitions lead to NP hard search problems => heuriic solutions => practical use is limited 59
Graph Cluering class of cluering methods that treat the data set as graph Obect= node; links diance, similarity, reachability diance... usually: only consider the k neare neighbors or an range => directed and undirected network are considered Cluering by weighted k mincut: Partition a graph G into k disunctive subgraphs having similar size while minimizing the number of removed edges. => Weighted k mincut is also an NP hard problem. 60 Graph Cluering: Spectral Cluering built a symmetric adacency matrix S: S sim ( x x ) Transform S into a graph Laplacian matrix L: L= I - D - 2 D - 2 after eigenvalue decomposition of L: S D sim x k 0 ( else i Eigenvectors with eigenvalues = 0, represent connected components Eigenvectors describe the linear weights to represent a cluer representative DB r k i EV i o i x k ) if 6
Conclusions Graph Mining Graph Mining includes new data mining tasks Ranking nodes Link prediction Dense subgraph discovery and community detection Frequent Subgraph Mining Cluering can be formulated as a graph problem Density based cluering: find all connected components where links denote a similarity predicate Spectral cluering weighted k mincut: Partition a graph into k subgraphs while minimizing the weights of the cut edges under size conraints w.r.t. the result subgraphs. 62 Literature Borgwardt K., Kriegel H. P.: Shorte path kernels on graphs. In Proc. Intl. Conf. Data Mining (ICDM 2005), 2005 Borgwardt K.: Graph Kernels, Dissertation im Fach Informatik, Ludwig Maximilians Universität München, 2007 Bunke, H. : Recent developments in graph matching. In ICPR, pages 27 224. 2000 Gärtner, T., Flach, P., and Wrobel: On graph kernels: Hardness results and efficient alternatives. Proc. Annual Conf. Computational Learning Theory, pages 29 43, 2003 Wiener, H.: Structural determination of paraffin boiling points. J. Am. Chem. Soc., 69():7 20, 947 63
Literature Yan X., Han J.: gspan: Graph based subructure pattern mining, In ICDM, 2002. Kuramochi M., Karypis G.: Frequent Subgraph Discovery, In ICDM, 200 Brin S., Page L.: The anatomy of a large scale hypertextual Web search engine, Computer Networks and ISDN Syems, Vol 30, Nr. 7, S.07 7,998 Kleinberg J. M.: Authoritative sources in a hyperlinked environment, Journal of the ACM,Vol.46, Nr. 5, S. 604 632,999 Brandes U.: A faer Algorithm for Betweenness Centrality, Journal of Mathematical Sociology, 25(2):63 77, 200 64