Cross-lingual and temporal Wikipedia analysis

Similar documents
CS 277: Data Mining. Mining Web Link Structure. CS 277: Data Mining Lectures Analyzing Web Link Structure Padhraic Smyth, UC Irvine

On Top-k Structural. Similarity Search. Pei Lee, Laks V.S. Lakshmanan University of British Columbia Vancouver, BC, Canada

To Randomize or Not To

ECEN 689 Special Topics in Data Science for Communications Networks

LINK ANALYSIS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Link Analysis Ranking

Introduction to Search Engine Technology Introduction to Link Structure Analysis. Ronny Lempel Yahoo Labs, Haifa

CS54701 Information Retrieval. Link Analysis. Luo Si. Department of Computer Science Purdue University. Borrowed Slides from Prof.

PageRank algorithm Hubs and Authorities. Data mining. Web Data Mining PageRank, Hubs and Authorities. University of Szeged.

Hyperlinked-Induced Topic Search (HITS) identifies. authorities as good content sources (~high indegree) HITS [Kleinberg 99] considers a web page

Wiki Definition. Reputation Systems I. Outline. Introduction to Reputations. Yury Lifshits. HITS, PageRank, SALSA, ebay, EigenTrust, VKontakte

CS47300: Web Information Search and Management

DATA MINING LECTURE 13. Link Analysis Ranking PageRank -- Random walks HITS

Link Analysis. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Lecture: Local Spectral Methods (2 of 4) 19 Computing spectral ranking with the push procedure

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

6.207/14.15: Networks Lecture 7: Search on Networks: Navigation and Web Search

Link Analysis. Leonid E. Zhukov

Web Ranking. Classification (manual, automatic) Link Analysis (today s lesson)

Slides based on those in:

6.207/14.15: Networks Lectures 4, 5 & 6: Linear Dynamics, Markov Chains, Centralities

Heat Kernel Based Community Detection

Lecture 12: Link Analysis for Web Retrieval

1998: enter Link Analysis

Online Social Networks and Media. Link Analysis and Web Search

Idea: Select and rank nodes w.r.t. their relevance or interestingness in large networks.

Lab 8: Measuring Graph Centrality - PageRank. Monday, November 5 CompSci 531, Fall 2018

Link Analysis. Stony Brook University CSE545, Fall 2016

Data Mining Recitation Notes Week 3

Intelligent Data Analysis. PageRank. School of Computer Science University of Birmingham

Data Mining and Analysis: Fundamental Concepts and Algorithms

0.1 Naive formulation of PageRank

Complex Social System, Elections. Introduction to Network Analysis 1

Data Mining and Matrices

As it is not necessarily possible to satisfy this equation, we just ask for a solution to the more general equation

An Introduction to Machine Learning

A Note on Google s PageRank

Online Social Networks and Media. Link Analysis and Web Search

Google Page Rank Project Linear Algebra Summer 2012

CS246: Mining Massive Datasets Jure Leskovec, Stanford University.

Link Mining PageRank. From Stanford C246

Information Retrieval and Search. Web Linkage Mining. Miłosz Kadziński

Introduction to Data Mining

MultiRank and HAR for Ranking Multi-relational Data, Transition Probability Tensors, and Multi-Stochastic Tensors

Data and Algorithms of the Web

Personal PageRank and Spilling Paint

CS6220: DATA MINING TECHNIQUES

Cutting Graphs, Personal PageRank and Spilling Paint

Thanks to Jure Leskovec, Stanford and Panayiotis Tsaparas, Univ. of Ioannina for slides

PageRank. Ryan Tibshirani /36-662: Data Mining. January Optional reading: ESL 14.10

Thanks to Jure Leskovec, Stanford and Panayiotis Tsaparas, Univ. of Ioannina for slides

Mining Newsgroups Using Networks Arising From Social Behavior by Rakesh Agrawal et al. Presented by Will Lee

CS6220: DATA MINING TECHNIQUES

Slide source: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University.

Computing PageRank using Power Extrapolation

CS 3750 Advanced Machine Learning. Applications of SVD and PCA (LSA and Link analysis) Cem Akkaya

Lecture 10. Lecturer: Aleksander Mądry Scribes: Mani Bastani Parizi and Christos Kalaitzis

Lecture: Local Spectral Methods (1 of 4)

ELEC6910Q Analytics and Systems for Social Media and Big Data Applications Lecture 3 Centrality, Similarity, and Strength Ties

Web Structure Mining Nodes, Links and Influence

The Static Absorbing Model for the Web a

CSI 445/660 Part 6 (Centrality Measures for Networks) 6 1 / 68

Combating Web Spam with TrustRank

Part A. P (w 1 )P (w 2 w 1 )P (w 3 w 1 w 2 ) P (w M w 1 w 2 w M 1 ) P (w 1 )P (w 2 w 1 )P (w 3 w 2 ) P (w M w M 1 )

Jeffrey D. Ullman Stanford University

Outline for today. Information Retrieval. Cosine similarity between query and document. tf-idf weighting

Internal link prediction: a new approach for predicting links in bipartite graphs

Statistical Problem. . We may have an underlying evolving system. (new state) = f(old state, noise) Input data: series of observations X 1, X 2 X t

EFFICIENT ALGORITHMS FOR PERSONALIZED PAGERANK

Recommendation Systems

Intelligent Systems (AI-2)

Model Reduction for Edge-Weighted Personalized PageRank

STA141C: Big Data & High Performance Statistical Computing

Edge-Weighted Personalized PageRank: Breaking a Decade-Old Performance Barrier

Intelligent Systems (AI-2)

Network Alignment 858L

Node Centrality and Ranking on Networks

Faloutsos, Tong ICDE, 2009

Google PageRank. Francesco Ricci Faculty of Computer Science Free University of Bozen-Bolzano

Recommendation Systems

Terms in Time and Times in Context: A Graph-based Term-Time Ranking Model

Updating PageRank. Amy Langville Carl Meyer

Hub, Authority and Relevance Scores in Multi-Relational Data for Query Search

Algorithms, Graph Theory, and the Solu7on of Laplacian Linear Equa7ons. Daniel A. Spielman Yale University

Spectral Graph Theory and You: Matrix Tree Theorem and Centrality Metrics

Uncertainty and Randomization

Communities Via Laplacian Matrices. Degree, Adjacency, and Laplacian Matrices Eigenvectors of Laplacian Matrices

Lecture 7 Mathematics behind Internet Search

IR: Information Retrieval

(K2) For sum of the potential differences around any cycle is equal to zero. r uv resistance of {u, v}. [In other words, V ir.]

PV211: Introduction to Information Retrieval

Degree Distribution: The case of Citation Networks

Variable Latent Semantic Indexing

Efficient Subgraph Matching by Postponing Cartesian Products. Matthias Barkowsky

Locally-biased analytics

Name: Matriculation Number: Tutorial Group: A B C D E

What is this Page Known for? Computing Web Page Reputations. Outline

Web Ranking. Classification (manual, automatic) Link Analysis (today s lesson)

Introduction to Information Retrieval

Transcription:

MTA SZTAKI Data Mining and Search Group June 14, 2013 Supported by the EC FET Open project New tools and algorithms for directed network analysis (NADINE No 288956)

Table of Contents 1 Link prediction on multilingual Wikipedia Motivation About SimRank Simrank for multilingual Wikipedia Link prediction 2 Temporal Wikipedia search by edits and linkage Motivation Selecting temporal changing subgraph Personalized PageRank and Personalized HITS

Section 1 Link prediction on multilingual Wikipedia

Multilingual Wikipedia Wikipedia articles about Erdős-number in German, French and Hungarian

Multilingual Graph model Edge types: links between articles category-contains-article relationship category-hierarchy-links interwiki links (between languages)

Statistics 3 languages: German, French, Hungarian snapshot from March 2012 lang. articles categories De 2 338 795 139 844 Fr 2 408 097 199 708 Hu 339 041 34 653 Parallel articles Parallel categories De-Fr 482 196 De-Hu 108 949 Fr-Hu 119 559 De-Fr 22 175 De-Hu 4 840 Fr-Hu 5 387 Only a small fraction of pages has an equivalent version Category hierarchies are entirely different

Applications, Use cases Motivation: cleansing, expanding local Wikipedia: new content from a bigger Wikipedia to a smaller more detailed content from a smaller, better specified Wikipedia to the bigger one Tag recommendation in similarly structured networks (LibraryThing, Amazon)

Link prediction We were focusing on: interwiki link recommendation for categories category recommendation for articles related entity recommendation for articles Similar methods are used: 1 Setting candidates 2 Ranking candidates (with Jaccard, SimRank, etc.)

Basic SimRank Equation Two pages are similar if pointed to by similar pages (Jeh Widom KDD 2002) The similarity between objects a and b: sim(a, b) [0, 1] sim(a, b) = C N(a) N(b) N(a) i=1 N(b) j=1 sim(n i (a), N j (b)) 1 if a = b otherwise Similarity between a and b is the average similarity between in-neighbors of a and in-neighbors of b C is called decay factor, it is a constant between 0 and 1

Simrank with random walks Expected meeting distance is the expected time of how soon two random surfers (starting from a and from b) meet at the same node, walking backwards on edges. EMD(a, b) = v,l P(after l steps a and b meet at v) l Expected f -meeting distance f EMD(a, b) = v,l P(after l steps a and b meet at v) f (l) Usually f (x) = C x is choosen with C (0, 1), since it transformes distance to similarity.

SimRank with random walks Let s define s(a, b) = v,l P(after l steps a and b meet at v) C l It is easy to show that sim(a, b) is the same as s(a, b) Corollary: SimRank can be approximated with (backwards) random walks.

Simrank for multilingual Wikipedia Random walk: 1. decide, whether we continue the walk on a normal edge (with α probability) or on an interwiki link (with 1 α probability). 2. select uniformly an edge with the type determined above Equivalent: generating random walk on an edge-weighted graph

SimRank for edge-weighted graphs Let s start a walk from G with α = 0.6

SimRank for edge-weighted graphs We choose according to the following probabilities. Let s go to D!

SimRank for edge-weighted graphs Standing in D we have the following oportunities.

Category recommendation for an article Given German and French Wikipedia and we want to find a new category for article A 2

Category recommendation for an article Given German and French Wikipedia and we want to find a new category for article A 2 1 Take B 1, the equivalent article in German

Category recommendation for an article Given German and French Wikipedia and we want to find a new category for article A 2 1 Take B 1, the equivalent article in German 2 Take the categories of B 1 but discard trivial ones (K 1 s equivalent is already the category of A 2, K 4 doesn t have a pair in French)

Category recommendation for an article Given German and French Wikipedia and we want to find a new category for article A 2 1 Take B 1, the equivalent article in German 2 Take the categories of B 1 but discard trivial ones (K 1 s equivalent is already the category of A 2, K 4 doesn t have a pair in French) 3 The candidates are their French equivalents: C 1, C 3

Category recommendation for an article Given German and French Wikipedia and we want to find a new category for article A 2 1 Take B 1, the equivalent article in German 2 Take the categories of B 1 but discard trivial ones (K 1 s equivalent is already the category of A 2, K 4 doesn t have a pair in French) 3 The candidates are their French equivalents: C 1, C 3 4 Rank candidates

Ranking methods Weighted Jaccard (details were skipped here) SimRank Novelty: Nov(x) = 1 SimRank(c 1,..., c n, x) where x is a candidate category for article a, and the current categories of a are c 1,..., c n Similarity of several nodes: s(v 1,..., v k ) = C I (v 1 ) I (v k ) u 1 I (v 1 )... u k I (v k ) s(u 1,..., u k )

Evaluation In each experiment 10 % of the respective edges were deleted (Interwiki links: 13000, Categories: 1 914 000, related articles: 8.5 Mill. ) For interwiki links: one ground truth for each input For categories and related articles: several ground truth instances Measures for the output quality: MRR (mean reciprocial rank) ndcg (standard measure for IR - problems) Recall Precision Manual assessment for type-2 and type-3 This was a joint work with MPII, Saarbrücken (N. Prytkova, M.Spaniol, G.Weikum)

Section 2 Temporal Wikipedia search by edits and linkage

Motivation Wikipedia has the great virtue of being utterly up-to-date A significant event usually has an immediate trace Considering a chain of events, we are often interested in the causes and effects, naturally represented by citations and links. If we want to know how a story evolved in time, we also need the information about the time of appearance of pages and links

Change measure We measure change as the sum of Difference between the logarithm of the in-degree between the two dates; Same for out-degree; Absolute difference between the number of words in the article between the two dates.

Change measure We measure change as the sum of Difference between the logarithm of the in-degree between the two dates; Same for out-degree; Absolute difference between the number of words in the article between the two dates. The change of a node is interesting, if the neighborhood of the node has changed as well E.g. Learning to rank vs. Occupy movement

Experiment settings Goal: Given a query Q and we want to find a subgraph that consists of relevant articles respective to the topic, this graph changes with time and this change explains the considered events related to Q. 3 snapshots (2011. september, october, november) 35 queries with ground truth answers Evaluation: recall, NDCG, MRR graph density Visualizing the graphs with our in-house built tool (subjective evaluation)

Selecting temporal changing subgraph Main steps of our algorithm: 1 Composing the seed set from the search results 2 Expanding of seed set and consider the induced subgraph 3 Computing personalization vector 4 Assinging scores to the nodes 5 Selecting the top-15 (and their induced subgraph in each snapshot)

Seed set and expansion Wikipedia content is indexed by a search engine top results usually don t form a connected graph we expand this set of nodes in order to get a possibly connected component new nodes are expected to be relevant or recently edited candidates: 1-step neighborhood of the seed set ( score(v) = IR(u) + max u seed expand with the nodes with highest rank change(u) + change(v) 2 )

Personalization method both IR score and change are relevant: before combination both value need to be scaled α: trade-off between change and relevance T : depends on the distribution of change values (T = 10 worked fine) p(u) = α IR(u) change(u) + (1 α) maxir (change(u) + T )

Personalized PageRank Random surfer model Browsing the web, following hyperlinks Sometimes she gets bored and teleports When teleporting, take distribution d (instead of uniform distribution) PPR (k+1)t d = PPR (k)t d ((1 α)m + α D) = PPR (1)T d ((1 α)m + α D) k where D has all rows equal to the personalization vector d = (d 1,..., d n ) d d D =... d

HITS algorithm - Hubs and Authorities Idea behind HITS: A good hub is a page that pointed to many other pages, A good authority is a page that was linked by many different hubs â(v) = w(vu) h(u), a = â/ a vu E ĥ(v) = w(uv) a(u), uv E h = ĥ/ h

Personalized HITS supersource: a new node of the graph which is connected with each node of the original graph, and the weight of an edge corresponds to the importance of the respective node in the personalization The supersource distributes a fixed amount of score in each iteration â(v) = w(vu) h(u) + c p(v), vu E ĥ(v) = w(uv) a(u) + c p(v), uv E a = â/ a h = ĥ/ h

Node scoring Multiple versions: Scores from Hubs Scores from Authorities combination of Hub and Authority vector Other combinations are possible as well: Hubs & PageRank Authority & Pagerank Hubs & Authority & Pagerank

Results ndcg values Growth of number of edges

An example for changing subgraph Result for Greek economy

An example for changing subgraph Result for Greek economy

An example for changing subgraph Result for Greek economy

Thank you for your attention!