CS249: ADVANCED DATA MINING

Similar documents
CS6220: DATA MINING TECHNIQUES

CS6220: DATA MINING TECHNIQUES

Data and Algorithms of the Web

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University.

CS345a: Data Mining Jure Leskovec and Anand Rajaraman Stanford University

Link Mining PageRank. From Stanford C246

Slides based on those in:

Google PageRank. Francesco Ricci Faculty of Computer Science Free University of Bozen-Bolzano

Slide source: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University.

CS249: ADVANCED DATA MINING

Introduction to Data Mining

0.1 Naive formulation of PageRank

Thanks to Jure Leskovec, Stanford and Panayiotis Tsaparas, Univ. of Ioannina for slides

Thanks to Jure Leskovec, Stanford and Panayiotis Tsaparas, Univ. of Ioannina for slides

CS249: ADVANCED DATA MINING

Link Analysis. Stony Brook University CSE545, Fall 2016

CS249: ADVANCED DATA MINING

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

CS 277: Data Mining. Mining Web Link Structure. CS 277: Data Mining Lectures Analyzing Web Link Structure Padhraic Smyth, UC Irvine

Data Mining Techniques

A Note on Google s PageRank

CS6220: DATA MINING TECHNIQUES

Jeffrey D. Ullman Stanford University

Online Social Networks and Media. Link Analysis and Web Search

Online Social Networks and Media. Link Analysis and Web Search

CS6220: DATA MINING TECHNIQUES

DATA MINING LECTURE 13. Link Analysis Ranking PageRank -- Random walks HITS

CS145: INTRODUCTION TO DATA MINING

Link Analysis Ranking

CS249: ADVANCED DATA MINING

1998: enter Link Analysis

CS6220: DATA MINING TECHNIQUES

Google Page Rank Project Linear Algebra Summer 2012

Data Mining Techniques

Lab 8: Measuring Graph Centrality - PageRank. Monday, November 5 CompSci 531, Fall 2018

Data Mining Recitation Notes Week 3

Jeffrey D. Ullman Stanford University

Today. Next lecture. (Ch 14) Markov chains and hidden Markov models

CS145: INTRODUCTION TO DATA MINING

Introduction to Search Engine Technology Introduction to Link Structure Analysis. Ronny Lempel Yahoo Labs, Haifa

Web Ranking. Classification (manual, automatic) Link Analysis (today s lesson)

Statistical Problem. . We may have an underlying evolving system. (new state) = f(old state, noise) Input data: series of observations X 1, X 2 X t

PageRank. Ryan Tibshirani /36-662: Data Mining. January Optional reading: ESL 14.10

Machine Learning CPSC 340. Tutorial 12

CS6220: DATA MINING TECHNIQUES

Page rank computation HPC course project a.y

CS246 Final Exam, Winter 2011

Graph Models The PageRank Algorithm

CS145: INTRODUCTION TO DATA MINING

Link Analysis Information Retrieval and Data Mining. Prof. Matteo Matteucci

IR: Information Retrieval

Application. Stochastic Matrices and PageRank

Uncertainty and Randomization

Learning from Labeled and Unlabeled Data: Semi-supervised Learning and Ranking p. 1/31

Lecture 6. Notes on Linear Algebra. Perceptron

Web Structure Mining Nodes, Links and Influence

Link Analysis. Leonid E. Zhukov

Algebraic Representation of Networks

Information Retrieval and Search. Web Linkage Mining. Miłosz Kadziński

How does Google rank webpages?

PageRank algorithm Hubs and Authorities. Data mining. Web Data Mining PageRank, Hubs and Authorities. University of Szeged.

Spectral Graph Theory and You: Matrix Tree Theorem and Centrality Metrics

Computing PageRank using Power Extrapolation

ECEN 689 Special Topics in Data Science for Communications Networks

Link Analysis. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze

Pr[positive test virus] Pr[virus] Pr[positive test] = Pr[positive test] = Pr[positive test]

Classification Semi-supervised learning based on network. Speakers: Hanwen Wang, Xinxin Huang, and Zeyu Li CS Winter

Updating PageRank. Amy Langville Carl Meyer

Lecture: Local Spectral Methods (1 of 4)

Data science with multilayer networks: Mathematical foundations and applications

CS6220: DATA MINING TECHNIQUES

MAE 298, Lecture 8 Feb 4, Web search and decentralized search on small-worlds

Lecture 12: Link Analysis for Web Retrieval

Calculating Web Page Authority Using the PageRank Algorithm

Lecture 11. Kernel Methods

Math 304 Handout: Linear algebra, graphs, and networks.

1 Matrix notation and preliminaries from spectral graph theory

CPSC 540: Machine Learning

Recommendation Systems

CS-E4830 Kernel Methods in Machine Learning

Web Ranking. Classification (manual, automatic) Link Analysis (today s lesson)

Conditioning of the Entries in the Stationary Vector of a Google-Type Matrix. Steve Kirkland University of Regina

How works. or How linear algebra powers the search engine. M. Ram Murty, FRSC Queen s Research Chair Queen s University

6.207/14.15: Networks Lecture 7: Search on Networks: Navigation and Web Search

Final Exam, Machine Learning, Spring 2009

SVD, Power method, and Planted Graph problems (+ eigenvalues of random matrices)

Lecture: Local Spectral Methods (2 of 4) 19 Computing spectral ranking with the push procedure

1 Searching the World Wide Web

Wiki Definition. Reputation Systems I. Outline. Introduction to Reputations. Yury Lifshits. HITS, PageRank, SALSA, ebay, EigenTrust, VKontakte

Machine Learning for Data Science (CS4786) Lecture 11

STA141C: Big Data & High Performance Statistical Computing

Lecture 7 Mathematics behind Internet Search

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Markov Chains and Spectral Clustering

As it is not necessarily possible to satisfy this equation, we just ask for a solution to the more general equation

Applications. Nonnegative Matrices: Ranking

Active and Semi-supervised Kernel Classification

Machine Learning, Midterm Exam

No class on Thursday, October 1. No office hours on Tuesday, September 29 and Thursday, October 1.

Transcription:

CS249: ADVANCED DATA MINING Graph and Network Instructor: Yizhou Sun yzsun@cs.ucla.edu May 31, 2017

Methods Learnt Classification Clustering Vector Data Text Data Recommender System Decision Tree; Naïve Bayes; Logistic Regression SVM; NN K-means; hierarchical clustering; DBSCAN; Mixture Models; kernel k-means PLSA; LDA Matrix Factorization Graph & Network Label Propagation SCAN; Spectral Clustering Prediction Ranking Linear Regression GLM Collaborative Filtering PageRank Feature Representation Word embedding Network embedding 2

Methods to Learn Classification Clustering Vector Data Text Data Recommender System Decision Tree; Naïve Bayes; Logistic Regression SVM; NN K-means; hierarchical clustering; DBSCAN; Mixture Models; kernel k-means PLSA; LDA Matrix Factorization Graph & Network Label Propagation SCAN; Spectral Clustering Prediction Ranking Linear Regression GLM Collaborative Filtering PageRank Feature Representation Word embedding Network embedding 3

Mining Graph/Network Data Introduction to Graph/Network Data PageRank Classification via Label Propagation Spectral Clustering Summary 4

Graph, Graph, Everywhere from H. Jeong et al Nature 411, 41 (2001) Aspirin Yeast protein interaction network Internet Co-author network 5

Why Graph Mining? Graphs are ubiquitous Chemical compounds (Cheminformatics) Protein structures, biological pathways/networks (Bioinformactics) Program control flow, traffic flow, and workflow analysis XML databases, Web, and social network analysis Graph is a general model Trees, lattices, sequences, and items are degenerated graphs Diversity of graphs Directed vs. undirected, labeled vs. unlabeled (edges & vertices), weighted, with angles & geometry (topological vs. 2-D/3-D) Complexity of algorithms: many problems are of high complexity 6

Representation of a Graph GG =< VV, EE > VV = {uu 1,, uu nn }: node set EE VV VV: edge set Adjacency matrix AA = aa iiii, ii, jj = 1,, NN aa iiii = 1, iiii < uu ii, uu jj > EE aa iiii = 0, iiii < uu ii, uu jj > EE Undirected graph vs. Directed graph AA = AA T vvvv. AA AA T Weighted graph Use W instead of A, where ww iiii represents the weight of edge < uu ii, uu jj > 7

Example y a m Yahoo y 1 1 0 a 1 0 1 m 0 1 0 Adjacency matrix A Amazon M soft 8

Mining Graph/Network Data Introduction to Graph/Network Data PageRank Classification via Label Propagation Spectral Clustering Summary 9

The History of PageRank PageRank was developed by Larry Page (hence the name Page-Rank) and Sergey Brin. It is first as part of a research project about a new kind of search engine. That project started in 1995 and led to a functional prototype in 1998. Shortly after, Page and Brin founded Google.

Ranking web pages Web pages are not equally important www.cnn.com vs. a personal webpage Inlinks as votes The more inlinks, the more important Are all inlinks equal? Higher ranked inlink should play a more important role Recursive question! 11

Simple recursive formulation Each link s vote is proportional to the importance of its source page If page P with importance x has n outlinks, each link gets x/n votes Page P s own importance is the sum of the votes on its inlinks Yahoo 1/2 Amazon 1 M soft 12

Matrix formulation Matrix M has one row and one column for each web page y a m Suppose page j has n outlinks If j -> i, then M ij =1/n Else M ij =0 M is a column stochastic matrix Columns sum to 1 Suppose r is a vector with one entry per web page r i is the importance score of page i Call it the rank vector r = 1 (i.e., rr 1 + rr 2 + + rr NN = 1) y 1 1 0 a 1 0 1 m 0 1 0 ½, 0, 1 13

Eigenvector formulation The flow equations can be written r = Mr So the rank vector is an eigenvector of the stochastic web matrix In fact, its first or principal eigenvector, with corresponding eigenvalue 1 14

Example Yahoo y a m y 1/2 1/2 0 a 1/2 0 1 m 0 1/2 0 Amazon M soft y = y /2 + a /2 a = y /2 + m m = a /2 r = M * r y 1/2 1/2 0 y a = 1/2 0 1 a m 0 1/2 0 m 15

Power Iteration method Simple iterative scheme Suppose there are N web pages Initialize: r 0 = [1/N,.,1/N] T Iterate: r k+1 = Mr k Stop when r k+1 - r k 1 < ε x 1 = 1 i N x i is the L1 norm Can use any other vector norm e.g., Euclidean 16

Power Iteration Example Yahoo y a m y 1/2 1/2 0 a 1/2 0 1 m 0 1/2 0 Amazon M soft y a = m 1/3 1/3 1/3 1/3 1/2 1/6 5/12 1/3 1/4 3/8 11/24 1/6... 2/5 2/5 1/5 rr 0 rr 1 rr 2 rr 3 rr

Random Walk Interpretation Imagine a random web surfer At any time t, surfer is on some page P At time t+1, the surfer follows an outlink from P uniformly at random Ends up on some page Q linked from P Process repeats indefinitely Let p(t) be a vector whose i th component is the probability that the surfer is at page i at time t p(t) is a probability distribution on pages 18

The stationary distribution Where is the surfer at time t+1? Follows a link uniformly at random p(t+1) = Mp(t) Suppose the random walk reaches a state such that p(t+1) = Mp(t) = p(t) Then p(t) is called a stationary distribution for the random walk Our rank vector r satisfies r = Mr So it is a stationary distribution for the random surfer 19

Existence and Uniqueness A central result from the theory of random walks (aka Markov processes): For graphs that satisfy certain conditions, the stationary distribution is unique and eventually will be reached no matter what the initial probability distribution at time t = 0. 20

Spider traps A group of pages is a spider trap if there are no links from within the group to outside the group Random surfer gets trapped Spider traps violate the conditions needed for the random walk theorem 21

Microsoft becomes a spider trap Yahoo y a m y 1/2 1/2 0 a 1/2 0 0 m 0 1/2 1 Amazon M soft y a = m 1/3 1/3 1/3 1/3 1/6 1/2 1/4 1/6 7/12 5/24 1/8 2/3... 0 0 1 22

Random teleports The Google solution for spider traps At each time step, the random surfer has two options: With probability β, follow a link at random With probability 1-β, jump to some page uniformly at random Common values for β are in the range 0.8 to 0.9 Surfer will teleport out of spider trap within a few time steps 23

Random teleports (β = 0.8) 0.2*1/3 Yahoo 1/2 0.8*1/2 1/2 0.8*1/2 0.2*1/3 0.2*1/3 Amazon M soft y y 1/2 a 1/2 m 0 0.8* y 1/2 1/2 0 1/2 1/2 0 0.8 1/2 0 0 + 0.2 0 1/2 1 + 0.2* y 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 : teleport links from Yahoo y 7/15 7/15 1/15 a 7/15 1/15 1/15 m 1/15 7/15 13/15 24

Random teleports (β = 0.8) Yahoo 1/2 1/2 0 0.8 1/2 0 0 + 0.2 0 1/2 1 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 Amazon y a = m M soft y 7/15 7/15 1/15 a 7/15 1/15 1/15 m 1/15 7/15 13/15 25

Matrix formulation Suppose there are N pages Consider a page j, with set of outlinks O(j) We have M ij = 1/ O(j) when j->i and M ij = 0 otherwise The random teleport is equivalent to adding a teleport link from j to every other page with probability (1-β)/N reducing the probability of following each outlink from 1/ O(j) to β/ O(j) Equivalent: tax each page a fraction (1-β) of its score and redistribute evenly 26

PageRank Construct the N-by-N matrix A as follows A ij = βm ij + (1-β)/N Verify that A is a stochastic matrix The page rank vector r is the principal eigenvector of this matrix satisfying r = Ar Equivalently, r is the stationary distribution of the random walk with teleports 27

Dead ends Pages with no outlinks are dead ends for the random surfer Nowhere to go on next step 28

Microsoft becomes a dead end Yahoo 1/2 1/2 0 0.8 1/2 0 0 + 0.2 0 1/2 0 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 Amazon M soft y 7/15 7/15 1/15 a 7/15 1/15 1/15 m 1/15 7/15 1/15 y a = m 1/3 1/3 1/3 1/3 0.2 0.2... 0 0 0 Nonstochastic! 29

Teleport Dealing with dead-ends Follow random teleport links with probability 1.0 from dead-ends Adjust matrix accordingly Prune and propagate Preprocess the graph to eliminate dead-ends Might require multiple passes Compute page rank on reduced graph Approximate values for deadends by propagating values from reduced graph 30

Dealing dead end: teleport Yahoo 1/2 1/2 0 0.8 1/2 0 0 + 0 1/2 0 0.2*1/3 0.2*1/3 1*1/3 0.2*1/3 0.2*1/3 1*1/3 0.2*1/3 0.2*1/3 1*1/3 Amazon M soft y 7/15 7/15 1/3 a 7/15 1/15 1/3 m 1/15 7/15 1/3 31

Dealing dead end: reduce graph Yahoo Yahoo Ex.1: Amazon M soft Amazon Ex.2: Yahoo B Yahoo Yahoo Amazon M soft Amazon M soft Amazon 32

Computing PageRank Key step is matrix-vector multiplication r new = Ar old Easy if we have enough main memory to hold A, r old, r new Say N = 1 billion pages We need 4 bytes for each entry (say) 2 billion entries for vectors, approx 8GB Matrix A has N 2 entries 10 18 is a large number! 33

Rearranging the equation r = Ar, where A ij = βm ij + (1-β)/N r i = 1 j N A ij r j r i = 1 j N [βm ij + (1-β)/N] r j = β 1 j N M ij r j + (1-β)/N 1 j N r j = β 1 j N M ij r j + (1-β)/N, since r = 1 r = βmr + [(1-β)/N] N where [x] N is an N-vector with all entries x 34

Sparse matrix formulation We can rearrange the page rank equation: r = βmr + [(1-β)/N] N [(1-β)/N] N is an N-vector with all entries (1-β)/N M is a sparse matrix! 10 links per node, approx 10N entries So in each iteration, we need to: Compute r new = βmr old Add a constant value (1-β)/N to each entry in r new 35

Sparse matrix encoding Encode sparse matrix using only nonzero entries Space proportional roughly to number of links say 10N, or 4*10*1 billion = 40GB still won t fit in memory, but will fit on disk source node degree 0 3 1, 5, 7 1 5 17, 64, 113, 117, 245 2 2 13, 23 destination nodes 36

Basic Algorithm Assume we have enough RAM to fit r new, plus some working memory Store r old and matrix M on disk Basic Algorithm: Initialize: r old = [1/N] N Iterate: Update: Perform a sequential scan of M and r old to update r new Write out r new to disk as r old for next iteration Every few iterations, compute r new -r old and stop if it is below threshold Need to read in both vectors into memory 37

Mining Graph/Network Data Introduction to Graph/Network Data PageRank Classification via Label Propagation Spectral Clustering Summary 38

Label Propagation in the Network Given a network, some nodes are given labels, can we classify the unlabeled nodes by using link information? E.g., Node 12 belongs to Class 1 and Node 5 Belongs to Class 2 5 12 39

Reference Learning from Labeled and Unlabeled Data with Label Propagation By Xiaojin Zhu and Zoubin Ghahramani http://www.cs.cmu.edu/~zhuxj/pub/cmu- CALD-02-107.pdf 40

Given n nodes Problem Formalization l with labels (e.g., YY 1, YY 2,, YY ll aaaaaa kkkkkkkkkk) u without labels (e.g., YY ll+1, YY ll+2,, YY nn are unknown) YY iiii ttttt nn CC llllllllll mmmmmmmmmmmm C is the number of labels (classes) The adjacency matrix is W The probabilistic transition matrix T TT iiii = PP jj ii = ww iiii kk ww kkkk 41

The Label Propagation Algorithm Step 1: Propagate YY TTTT YY ii = jj TT iiii YY jj = jj PP jj ii YY jj Step 2: Row-normalize Y The summation of the probability of each object belonging to each class is 1 Step 3: Reset the labels for the labeled nodes. Repeat 1-3 until Y converges 42

Example: Iter = 0 5 12 43

Example: Iter = 1 3 2 5 7 4 1 8 12 11 0 9 10 44

Example: Iter = 2 2 3 8 12 7 11 6 4 0 5 1 9 10 13 YY 6 = PP(7 6)YY 7 + PP 11 6 YY 11 + PP 10 6 YY 10 + PP 3 6 YY 3 + PP 4 6 YY 4 + PP 0 6 YY 0 = (3/4,0)+(0,3/4) = (3/4,3/4) After normalization, YY 6 = ( 1 2, 1 2 ) 45

Repeat until converge 46

Mining Graph/Network Data Introduction to Graph/Network Data PageRank Classification via Label Propagation Spectral Clustering Summary 47

Applications Clustering Graphs and Network Data Bi-partite graphs, e.g., customers and products, authors and conferences Web search engines, e.g., click through graphs and Web graphs Social networks, friendship/coauthor graphs Clustering books about politics [Newman, 2006] 48

Spectral Clustering Reference: ICDM 09 Tutorial by Chris Ding Example: Clustering supreme court justices according to their voting behavior W = 49

Example: Continue 50

Min-Cut Spectral Graph Partition Minimize the # of cut of edges 51

Objective Function 52

Step 1: Algorithm Calculate Graph Laplacian matrix: LL = DD WW Step 2: Calculate the second eigenvector q Step 3: Bisect q (e.g., 0) to get two clusters 53

*Minimum Cut with Constraints 54

*New Objective Functions 55

Other References A Tutorial on Spectral Clustering by U. Luxburg http://www.kyb.mpg.de/fileadmin/user_u pload/files/publications/attachments/lux burg07_tutorial_4488%5b0%5d.pdf 56

Mining Graph/Network Data Introduction to Graph/Network Data PageRank Classification via Label Propagation Spectral Clustering Summary 57

Summary Ranking on Graph / Network PageRank Classification via label propagation Clustering Spectral clustering 58

Announcements Presentation next week Team 1-4: Monday Team 5-8: Wednesday Project report, code, and data 6/12 Teaching evaluation form 59