Cross-lingual and temporal Wikipedia analysis

MTA SZTAKI Data Mining and Search Group June 14, 2013 Supported by the EC FET Open project New tools and algorithms for directed network analysis (NADINE No 288956)

Table of Contents 1 Link prediction on multilingual Wikipedia Motivation About SimRank Simrank for multilingual Wikipedia Link prediction 2 Temporal Wikipedia search by edits and linkage Motivation Selecting temporal changing subgraph Personalized PageRank and Personalized HITS

Section 1 Link prediction on multilingual Wikipedia

Multilingual Wikipedia Wikipedia articles about Erdős-number in German, French and Hungarian

Multilingual Graph model Edge types: links between articles category-contains-article relationship category-hierarchy-links interwiki links (between languages)

Statistics 3 languages: German, French, Hungarian snapshot from March 2012 lang. articles categories De 2 338 795 139 844 Fr 2 408 097 199 708 Hu 339 041 34 653 Parallel articles Parallel categories De-Fr 482 196 De-Hu 108 949 Fr-Hu 119 559 De-Fr 22 175 De-Hu 4 840 Fr-Hu 5 387 Only a small fraction of pages has an equivalent version Category hierarchies are entirely different

Applications, Use cases Motivation: cleansing, expanding local Wikipedia: new content from a bigger Wikipedia to a smaller more detailed content from a smaller, better specified Wikipedia to the bigger one Tag recommendation in similarly structured networks (LibraryThing, Amazon)

Link prediction We were focusing on: interwiki link recommendation for categories category recommendation for articles related entity recommendation for articles Similar methods are used: 1 Setting candidates 2 Ranking candidates (with Jaccard, SimRank, etc.)

Basic SimRank Equation Two pages are similar if pointed to by similar pages (Jeh Widom KDD 2002) The similarity between objects a and b: sim(a, b) [0, 1] sim(a, b) = C N(a) N(b) N(a) i=1 N(b) j=1 sim(n i (a), N j (b)) 1 if a = b otherwise Similarity between a and b is the average similarity between in-neighbors of a and in-neighbors of b C is called decay factor, it is a constant between 0 and 1

Simrank with random walks Expected meeting distance is the expected time of how soon two random surfers (starting from a and from b) meet at the same node, walking backwards on edges. EMD(a, b) = v,l P(after l steps a and b meet at v) l Expected f -meeting distance f EMD(a, b) = v,l P(after l steps a and b meet at v) f (l) Usually f (x) = C x is choosen with C (0, 1), since it transformes distance to similarity.

SimRank with random walks Let s define s(a, b) = v,l P(after l steps a and b meet at v) C l It is easy to show that sim(a, b) is the same as s(a, b) Corollary: SimRank can be approximated with (backwards) random walks.

Simrank for multilingual Wikipedia Random walk: 1. decide, whether we continue the walk on a normal edge (with α probability) or on an interwiki link (with 1 α probability). 2. select uniformly an edge with the type determined above Equivalent: generating random walk on an edge-weighted graph

SimRank for edge-weighted graphs Let s start a walk from G with α = 0.6

SimRank for edge-weighted graphs We choose according to the following probabilities. Let s go to D!

SimRank for edge-weighted graphs Standing in D we have the following oportunities.

Category recommendation for an article Given German and French Wikipedia and we want to find a new category for article A 2

Category recommendation for an article Given German and French Wikipedia and we want to find a new category for article A 2 1 Take B 1, the equivalent article in German

Category recommendation for an article Given German and French Wikipedia and we want to find a new category for article A 2 1 Take B 1, the equivalent article in German 2 Take the categories of B 1 but discard trivial ones (K 1 s equivalent is already the category of A 2, K 4 doesn t have a pair in French)

Ranking methods Weighted Jaccard (details were skipped here) SimRank Novelty: Nov(x) = 1 SimRank(c 1,..., c n, x) where x is a candidate category for article a, and the current categories of a are c 1,..., c n Similarity of several nodes: s(v 1,..., v k ) = C I (v 1 ) I (v k ) u 1 I (v 1 )... u k I (v k ) s(u 1,..., u k )

Evaluation In each experiment 10 % of the respective edges were deleted (Interwiki links: 13000, Categories: 1 914 000, related articles: 8.5 Mill. ) For interwiki links: one ground truth for each input For categories and related articles: several ground truth instances Measures for the output quality: MRR (mean reciprocial rank) ndcg (standard measure for IR - problems) Recall Precision Manual assessment for type-2 and type-3 This was a joint work with MPII, Saarbrücken (N. Prytkova, M.Spaniol, G.Weikum)

Section 2 Temporal Wikipedia search by edits and linkage

Motivation Wikipedia has the great virtue of being utterly up-to-date A significant event usually has an immediate trace Considering a chain of events, we are often interested in the causes and effects, naturally represented by citations and links. If we want to know how a story evolved in time, we also need the information about the time of appearance of pages and links

Change measure We measure change as the sum of Difference between the logarithm of the in-degree between the two dates; Same for out-degree; Absolute difference between the number of words in the article between the two dates. The change of a node is interesting, if the neighborhood of the node has changed as well E.g. Learning to rank vs. Occupy movement

Experiment settings Goal: Given a query Q and we want to find a subgraph that consists of relevant articles respective to the topic, this graph changes with time and this change explains the considered events related to Q. 3 snapshots (2011. september, october, november) 35 queries with ground truth answers Evaluation: recall, NDCG, MRR graph density Visualizing the graphs with our in-house built tool (subjective evaluation)

Selecting temporal changing subgraph Main steps of our algorithm: 1 Composing the seed set from the search results 2 Expanding of seed set and consider the induced subgraph 3 Computing personalization vector 4 Assinging scores to the nodes 5 Selecting the top-15 (and their induced subgraph in each snapshot)

Seed set and expansion Wikipedia content is indexed by a search engine top results usually don t form a connected graph we expand this set of nodes in order to get a possibly connected component new nodes are expected to be relevant or recently edited candidates: 1-step neighborhood of the seed set ( score(v) = IR(u) + max u seed expand with the nodes with highest rank change(u) + change(v) 2 )

Personalization method both IR score and change are relevant: before combination both value need to be scaled α: trade-off between change and relevance T : depends on the distribution of change values (T = 10 worked fine) p(u) = α IR(u) change(u) + (1 α) maxir (change(u) + T )

Personalized PageRank Random surfer model Browsing the web, following hyperlinks Sometimes she gets bored and teleports When teleporting, take distribution d (instead of uniform distribution) PPR (k+1)t d = PPR (k)t d ((1 α)m + α D) = PPR (1)T d ((1 α)m + α D) k where D has all rows equal to the personalization vector d = (d 1,..., d n ) d d D =... d

HITS algorithm - Hubs and Authorities Idea behind HITS: A good hub is a page that pointed to many other pages, A good authority is a page that was linked by many different hubs â(v) = w(vu) h(u), a = â/ a vu E ĥ(v) = w(uv) a(u), uv E h = ĥ/ h

Personalized HITS supersource: a new node of the graph which is connected with each node of the original graph, and the weight of an edge corresponds to the importance of the respective node in the personalization The supersource distributes a fixed amount of score in each iteration â(v) = w(vu) h(u) + c p(v), vu E ĥ(v) = w(uv) a(u) + c p(v), uv E a = â/ a h = ĥ/ h

Node scoring Multiple versions: Scores from Hubs Scores from Authorities combination of Hub and Authority vector Other combinations are possible as well: Hubs & PageRank Authority & Pagerank Hubs & Authority & Pagerank

Results ndcg values Growth of number of edges

An example for changing subgraph Result for Greek economy

Thank you for your attention!