Large Graph Mining: Power Tools and a Practitioner s guide

Similar documents
Faloutsos, Kolda, Sun

Must-read Material : Multimedia Databases and Data Mining. Indexing - Detailed outline. Outline. Faloutsos

Higher-Order Web Link Analysis Using Multilinear Algebra

Faloutsos, Tong ICDE, 2009

CMU SCS Talk 3: Graph Mining Tools Tensors, communities, parallelism

Window-based Tensor Analysis on High-dimensional and Multi-aspect Streams

Large Graph Mining: Power Tools and a Practitioner s guide

Hub, Authority and Relevance Scores in Multi-Relational Data for Query Search

Tensor Analysis. Topics in Data Mining Fall Bruno Ribeiro

Fitting a Tensor Decomposition is a Nonlinear Optimization Problem

Link Analysis. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze

Thanks to Jure Leskovec, Stanford and Panayiotis Tsaparas, Univ. of Ioannina for slides

Link Analysis Information Retrieval and Data Mining. Prof. Matteo Matteucci

Thanks to Jure Leskovec, Stanford and Panayiotis Tsaparas, Univ. of Ioannina for slides

Two heads better than one: pattern discovery in time-evolving multi-aspect data

Link Analysis Ranking

DATA MINING LECTURE 13. Link Analysis Ranking PageRank -- Random walks HITS

CS 277: Data Mining. Mining Web Link Structure. CS 277: Data Mining Lectures Analyzing Web Link Structure Padhraic Smyth, UC Irvine

Two heads better than one: Pattern Discovery in Time-evolving Multi-Aspect Data

Talk 2: Graph Mining Tools - SVD, ranking, proximity. Christos Faloutsos CMU

Anomaly Detection in Temporal Graph Data: An Iterative Tensor Decomposition and Masking Approach

High Performance Parallel Tucker Decomposition of Sparse Tensors

Postgraduate Course Signal Processing for Big Data (MSc)

CS6220: DATA MINING TECHNIQUES

CMU SCS. Large Graph Mining. Christos Faloutsos CMU

Online Social Networks and Media. Link Analysis and Web Search

LINK ANALYSIS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

CS246 Final Exam, Winter 2011

CS6220: DATA MINING TECHNIQUES

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Hyperlinked-Induced Topic Search (HITS) identifies. authorities as good content sources (~high indegree) HITS [Kleinberg 99] considers a web page

Matrix Decomposition in Privacy-Preserving Data Mining JUN ZHANG DEPARTMENT OF COMPUTER SCIENCE UNIVERSITY OF KENTUCKY

Lecture 12: Link Analysis for Web Retrieval

As it is not necessarily possible to satisfy this equation, we just ask for a solution to the more general equation

Parallel Tensor Compression for Large-Scale Scientific Data

STA141C: Big Data & High Performance Statistical Computing

Bruce Hendrickson Discrete Algorithms & Math Dept. Sandia National Labs Albuquerque, New Mexico Also, CS Department, UNM

Slides based on those in:

MultiVis: Content-based Social Network Exploration Through Multi-way Visual Analysis

Online Social Networks and Media. Link Analysis and Web Search

A Three-Way Model for Collective Learning on Multi-Relational Data

CS 3750 Advanced Machine Learning. Applications of SVD and PCA (LSA and Link analysis) Cem Akkaya

What is this Page Known for? Computing Web Page Reputations. Outline

MultiRank: Co-Ranking for Objects and Relations in Multi-Relational Data

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

ON THE NP-COMPLETENESS OF SOME GRAPH CLUSTER MEASURES

Scalable Tensor Factorizations with Incomplete Data

Introduction to Search Engine Technology Introduction to Link Structure Analysis. Ronny Lempel Yahoo Labs, Haifa

CS249: ADVANCED DATA MINING

Lab 8: Measuring Graph Centrality - PageRank. Monday, November 5 CompSci 531, Fall 2018

Data Mining and Matrices

6.207/14.15: Networks Lecture 7: Search on Networks: Navigation and Web Search

MultiRank and HAR for Ranking Multi-relational Data, Transition Probability Tensors, and Multi-Stochastic Tensors

Cross-lingual and temporal Wikipedia analysis

Com2: Fast Automatic Discovery of Temporal ( Comet ) Communities

Kronecker Product Approximation with Multiple Factor Matrices via the Tensor Product Algorithm

CS246: Mining Massive Datasets Jure Leskovec, Stanford University.

CVPR A New Tensor Algebra - Tutorial. July 26, 2017

BEYOND ASSORTATIVITY PROCLIVITY INDEX FOR ATTRIBUTED NETWORKS (PRONE) Reihaneh Rabbany Dhivya Eswaran*

Tensor Decompositions and Applications

HAR: Hub, Authority and Relevance Scores in Multi-Relational Data for Query Search

Networks as vectors of their motif frequencies and 2-norm distance as a measure of similarity

Multi-Way Compressed Sensing for Big Tensor Data

1 Searching the World Wide Web

CS264: Beyond Worst-Case Analysis Lecture #15: Topic Modeling and Nonnegative Matrix Factorization

Data Mining and Matrices

Big Tensor Data Reduction

Outline : Multimedia Databases and Data Mining. Indexing - similarity search. Indexing - similarity search. C.

arxiv: v1 [cs.lg] 18 Nov 2018

PV211: Introduction to Information Retrieval

Page rank computation HPC course project a.y

Analytic Theory of Power Law Graphs

A Randomized Approach for Crowdsourcing in the Presence of Multiple Views

MalSpot: Multi 2 Malicious Network Behavior Patterns Analysis

Wiki Definition. Reputation Systems I. Outline. Introduction to Reputations. Yury Lifshits. HITS, PageRank, SALSA, ebay, EigenTrust, VKontakte

Cloud-Based Computational Bio-surveillance Framework for Discovering Emergent Patterns From Big Data

Using Rank Propagation and Probabilistic Counting for Link-based Spam Detection

Applications to network analysis: Eigenvector centrality indices Lecture notes

Tensor Decompositions for Machine Learning. G. Roeder 1. UBC Machine Learning Reading Group, June University of British Columbia

Using R for Iterative and Incremental Processing

CS6220: DATA MINING TECHNIQUES

<is web> Web Retrieval. Chapter 3: Authority Ranking. ISWeb Information Systems & Semantic Web University of Koblenz Landau. <Nr.

SVD, Power method, and Planted Graph problems (+ eigenvalues of random matrices)

A Note on Google s PageRank

Wafer Pattern Recognition Using Tucker Decomposition

The Canonical Tensor Decomposition and Its Applications to Social Network Analysis

CS6220: DATA MINING TECHNIQUES

Introduction to Information Retrieval

Spectral Graph Theory and You: Matrix Tree Theorem and Centrality Metrics

Three right directions and three wrong directions for tensor research

ZFL, Center of Remote Sensing of Land Surfaces, University of Bonn, Germany. Geographical Institute, University of Bonn, Germany

Google Page Rank Project Linear Algebra Summer 2012

Complex Networks CSYS/MATH 303, Spring, Prof. Peter Dodds

Slide source: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University.

SPARSE TENSORS DECOMPOSITION SOFTWARE

GigaTensor: Scaling Tensor Analysis Up By 100 Times - Algorithms and Discoveries

Lecture 7 Mathematics behind Internet Search

ParCube: Sparse Parallelizable Tensor Decompositions

Index. Copyright (c)2007 The Society for Industrial and Applied Mathematics From: Matrix Methods in Data Mining and Pattern Recgonition By: Lars Elden

CS47300: Web Information Search and Management

Transcription:

Large Graph Mining: Power Tools and a Practitioner s guide Task 5: Graphs over time & tensors Faloutsos, Miller, Tsourakakis CMU KDD '9 Faloutsos, Miller, Tsourakakis P5-1

Outline Introduction Motivation Task 1: Node importance Task 2: Community detection Task 3: Recommendations Task 4: Connection sub-graphs Task 5: Mining graphs over time Task 6: Virus/influence propagation Task 7: Spectral graph theory Task 8: Tera/peta graph mining: hadoop Observations patterns of real graphs Conclusions KDD '9 Faloutsos, Miller, Tsourakakis P5-2

Tamara Kolda (Sandia) Thanks to for the foils on tensor definitions, and on TOPHITS KDD '9 Faloutsos, Miller, Tsourakakis P5-3

Detailed outline Motivation Definitions: PARAFAC and Tucker Case study: web mining KDD '9 Faloutsos, Miller, Tsourakakis P5-4

Examples of Matrices: Authors and terms John Peter Mary Nick... data mining classif. tree... 13 11 22 55... 5 4 6 7................................................ KDD '9 Faloutsos, Miller, Tsourakakis P5-5

Motivation: Why tensors? Q: what is a tensor? KDD '9 Faloutsos, Miller, Tsourakakis P5-6

Motivation: Why tensors? A: N-D generalization of matrix: KDD 9 John Peter Mary Nick... data mining classif. tree... 13 11 22 55... 5 4 6 7................................................ KDD '9 Faloutsos, Miller, Tsourakakis P5-7

Motivation: Why tensors? A: N-D generalization of matrix: KDD 7 John Peter Mary Nick... KDD 8 KDD 9 data mining classif. tree... 13 11 22 55... 5 4 6 7................................................ KDD '9 Faloutsos, Miller, Tsourakakis P5-8

Tensors are useful for 3 or more modes Terminology: mode (or aspect ): Mode#3 data mining classif. tree... Mode#2 13 11 22 55... 5 4 6 7................................................ Mode (== aspect) #1 KDD '9 Faloutsos, Miller, Tsourakakis P5-9

Notice 3 rd mode does not need to be time we can have more than 3 modes Dest. port 125 8 IP source 13 11 22 55... 5 4 6 7................................................... IP destination KDD '9 Faloutsos, Miller, Tsourakakis P5-1

Notice 3 rd mode does not need to be time we can have more than 3 modes Eg, ffmri: x,y,z, time, person-id, task-id From DENLAB, Temple U. (Prof. V. Megalooikonomou +) http://denlab.temple.edu/bidms/cgi-bin/browse.cgi KDD '9 Faloutsos, Miller, Tsourakakis P5-11

Motivating Applications Why tensors are useful? web mining (TOPHITS) environmental sensors Intrusion detection (src, dst, time, dest-port) Social networks (src, dst, time, type-of-contact) face recognition etc KDD '9 Faloutsos, Miller, Tsourakakis P5-12

Detailed outline Motivation Definitions: PARAFAC and Tucker Case study: web mining KDD '9 Faloutsos, Miller, Tsourakakis P5-13

Tensor basics Multi-mode extensions of SVD recall that: KDD '9 Faloutsos, Miller, Tsourakakis P5-14

Reminder: SVD n n m A m Σ V T U Best rank-k approximation in L2 KDD '9 Faloutsos, Miller, Tsourakakis P5-15

Reminder: SVD n σ 1 u 1 v 1 σ 2 u 2 v 2 m A + Best rank-k approximation in L2 KDD '9 Faloutsos, Miller, Tsourakakis P5-16

Goal: extension to >=3 modes I x J x K I x R K x R C J x R ¼ B = + + A R x R x R KDD '9 Faloutsos, Miller, Tsourakakis P5-17

Main points: 2 major types of tensor decompositions: PARAFAC and Tucker both can be solved with ``alternating least squares (ALS) KDD '9 Faloutsos, Miller, Tsourakakis P5-18

Specially Structured Tensors Tucker Tensor Kruskal Tensor Our Notation Our Notation core W K x T W K x R I x J x K I x R J x S I x J x K w 1 I x R w R J x R = U R x S x T V = = u 1 v 1 U + + V R x R x R u R v R KDD '9 Faloutsos, Miller, Tsourakakis P5-19

Tucker Decomposition - intuition K x T C I x J x K I x R J x S ¼ A R x S x T B author x keyword x conference A: author x author-group B: keyword x keyword-group C: conf. x conf-group G: how groups relate to each other KDD '9 Faloutsos, Miller, Tsourakakis P5-2

Intuition behind core tensor 2-d case: co-clustering [Dhillon et al. Information-Theoretic Coclustering, KDD 3] KDD '9 Faloutsos, Miller, Tsourakakis P5-21

KDD '9 Faloutsos, Miller, Tsourakakis P5-22 CMU SCS.4.4.4.4.4.4.4.4.4.4.5.5.5.5.5.5.5.5.5.5.5.5.36.36.28.28.36.36.36.36.28 28.36.36.54.54.42.54.54.42.42.54.54.42.54.54.5.5.5.5.5.5.2.2.3 3. [ ].36.36.28.28.36.36 = m m n n l k k l eg, terms x documents

med. doc cs doc term group x doc. group.5.5.4.4.5.5.4.4.5.5.4.5.5.4.5.5.4.4.5.5.4.4 med. terms cs terms common terms.5.5.5.5.5.5 term x term-group [ ]. 3.36.36.28.3.28.36.36.2.2 doc x doc group KDD '9 Faloutsos, Miller, Tsourakakis P5-23 =.54.54.36.36.54.54.36.36.42.42 28.28.42.42.28.28.54.54.36.36.54.54.36.36

Two main tools PARAFAC Tucker Tensor tools - summary Both find row-, column-, tube-groups but in PARAFAC the three groups are identical ( To solve: Alternating Least Squares ) KDD '9 Faloutsos, Miller, Tsourakakis P5-24

Detailed outline Motivation Definitions: PARAFAC and Tucker Case study: web mining KDD '9 Faloutsos, Miller, Tsourakakis P5-25

Web graph mining How to order the importance of web pages? Kleinberg s algorithm HITS PageRank Tensor extension on HITS (TOPHITS) KDD '9 Faloutsos, Miller, Tsourakakis P5-26

Kleinberg s Hubs and Authorities (the HITS method) Sparse adjacency matrix and its SVD: authority scores for 1 st topic authority scores for 2 nd topic to from hub scores for 1 st topic hub scores for 2 nd topic KDD '9 Faloutsos, Miller, Tsourakakis P5-27 Kleinberg, JACM, 1999

from CMU SCS authority scores for 1 st topic to HITS Authorities on Sample Data 1st Principal Factor.97 www.ibm.com.24 www.alphaworks.ibm.com 2nd Principal Factor.8 www-128.ibm.com.99 www.lehigh.edu We started our crawl from.5 www.developer.ibm.com.11 www2.lehigh.edu http://www-neos.mcs.anl.gov/neos,.2 www.research.ibm.com 3rd Principal Factor.6 www.lehighalumni.com and crawled 47 pages,.1 www.redbooks.ibm.com.75 java.sun.com.6 www.lehighsports.com resulting in 56.1 news.com.com.38 www.sun.com.2 www.bethlehem-pa.gov cross-linked hosts..36 developers.sun.com.2 www.adobe.com 4th Principal Factor.24 see.sun.com.2 lewisweb.cc.lehigh.edu.6 www.pueblo.gsa.gov.16 www.samag.com.2 www.leo.lehigh.edu.45 www.whitehouse.gov.13 docs.sun.com.2 www.distance.lehigh.edu.35 www.irs.gov.12 blogs.sun.com.2 fp1.cc.lehigh.edu.31 travel.state.gov 6th Principal Factor.8 sunsolve.sun.com.22 www.gsa.gov.97 mathpost.asu.edu.8 www.sun-catalogue.com.2 www.ssa.gov.18 math.la.asu.edu.8 news.com.com.16 www.census.gov.17 www.asu.edu authority scores.14 www.govbenefits.gov.4 www.act.org for 2 nd topic.13 www.kids.gov.3 www.eas.asu.edu.13 www.usdoj.gov.2 archives.math.utk.edu.2 www.geom.uiuc.edu.2 www.fulton.asu.edu.2 www.amstat.org.2 www.maa.org hub scores KDD '9 Faloutsos, Miller, Tsourakakis for 1 st hub scores topic P5-28 for 2 nd topic

Three-Dimensional View of the Web Observe that this tensor is very sparse! KDD '9 Faloutsos, Miller, Tsourakakis P5-29 Kolda, Bader, Kenny, ICDM5

Three-Dimensional View of the Web Observe that this tensor is very sparse! KDD '9 Faloutsos, Miller, Tsourakakis P5-3 Kolda, Bader, Kenny, ICDM5

Three-Dimensional View of the Web Observe that this tensor is very sparse! KDD '9 Faloutsos, Miller, Tsourakakis P5-31 Kolda, Bader, Kenny, ICDM5

Topical HITS (TOPHITS) Main Idea: Extend the idea behind the HITS model to incorporate term (i.e., topical) information. term term scores for 1 st topic term scores for 2 nd topic to from hub scores for 1 st topic authority scores for 1 st topic authority scores for 2 nd topic hub scores for 2 nd topic KDD '9 Faloutsos, Miller, Tsourakakis P5-32

Topical HITS (TOPHITS) Main Idea: Extend the idea behind the HITS model to incorporate term (i.e., topical) information. term term scores for 1 st topic term scores for 2 nd topic to from hub scores for 1 st topic authority scores for 1 st topic authority scores for 2 nd topic hub scores for 2 nd topic KDD '9 Faloutsos, Miller, Tsourakakis P5-33

TOPHITS Terms & Authorities on Sample Data 1st Principal Factor.23 JAVA.86 java.sun.com.18 SUN.38 developers.sun.com 2nd Principal Factor.17 PLATFORM.16 docs.sun.com.2 NO-READABLE-TEXT.99 www.lehigh.edu TOPHITS uses 3D analysis to find.16 SOLARIS.14 see.sun.com.16 FACULTY.6.16 DEVELOPER.14 www.sun.com 3rd www2.lehigh.edu Principal Factor the dominant groupings of web.16 SEARCH.15 EDITION.15 NO-READABLE-TEXT.3 www.lehighalumni.com.9 www.samag.com.97 www.ibm.com pages and terms..16 NEWS.15 DOWNLOAD.15 IBM.7 developer.sun.com.18 www.alphaworks.ibm.com.16 LIBRARIES.14 INFO.12 SERVICES.6 sunsolve.sun.com.7 4th www-128.ibm.com Principal Factor.16 COMPUTING.12 SOFTWARE.12 WEBSPHERE.5 access1.sun.com.26 INFORMATION.5 www.developer.ibm.com.87 www.pueblo.gsa.gov.12 LEHIGH.12 NO-READABLE-TEXT.12 WEB.5 iforce.sun.com.24 FEDERAL.2 www.redbooks.ibm.com.24 www.irs.gov.11 DEVELOPERWORKS.23 CITIZEN.1 www.research.ibm.com.23 6th www.whitehouse.gov Principal Factor w.11 LINUX.22 OTHER.19 travel.state.gov k = # unique links using term k.26 PRESIDENT.87 www.whitehouse.gov.11 RESOURCES.19 CENTER.25 NO-READABLE-TEXT.18 www.gsa.gov.18 www.irs.gov.11 TECHNOLOGIES.19 LANGUAGES.25 BUSH.9 www.consumer.gov.16 travel.state.gov.1 DOWNLOADS.15 U.S.25 WELCOME.9 www.kids.gov 12th Principal Factor.1 www.gsa.gov.15 PUBLICATIONS.75 OPTIMIZATION.17 WHITE.7 www.ssa.gov.35 www.palisade.com.8 www.ssa.gov.14 CONSUMER.58 SOFTWARE.16 U.S.5 www.forms.gov.35 www.solver.com.5 www.govbenefits.gov.13 FREE.8 DECISION.4 www.govbenefits.gov.33 13th plato.la.asu.edu Principal Factor.15 HOUSE.7 NEOS.46 ADOBE.4 www.census.gov.29 www.mat.univie.ac.at.99 www.adobe.com.13 BUDGET.6 TREE.45 READER.4 www.usdoj.gov.28 www.ilog.com.13 PRESIDENTS.5 GUIDE.45 ACROBAT.4 www.kids.gov.26 www.dashoptimization.com 16th Principal Factor.11 OFFICE.2 www.forms.gov.5 SEARCH.3 FREE.5 WEATHER.26 www.grabitech.com.81 www.weather.gov Tensor PARAFAC.5 ENGINE.3 NO-READABLE-TEXT.24 OFFICE.25 www-fp.mcs.anl.gov.41 www.spc.noaa.gov.5 CONTROL.29 HERE.23 CENTER.22 www.spyderopts.com.3 lwf.ncdc.noaa.gov.5 ILOG.29 COPY.19 NO-READABLE-TEXT.17 www.mosek.com.15 19th www.cpc.ncep.noaa.gov Principal Factor term scores term scores for 1 st topic for 2.5 DOWNLOAD.17 ORGANIZATION.22 TAX.14 www.nhc.noaa.gov.73 www.irs.gov nd topic.15 NWS.17 TAXES.9 www.prh.noaa.gov.43 travel.state.gov to.15 SEVERE.15 CHILD.7 aviationweather.gov.22 www.ssa.gov.15 FIRE.15 RETIREMENT.6 www.nohrsc.nws.gov.8 www.govbenefits.gov authority scores authority scores.15 POLICY.14 BENEFITS.6 www.srh.noaa.gov.6 www.usdoj.gov for 1 st topic for 2 nd topic.14 CLIMATE.14 STATE.3 www.census.gov hub scores.14 INCOME.3 www.usmint.gov for 2 KDD '9 hub scores nd topic Faloutsos, Miller, Tsourakakis.13 SERVICE.2 www.nws.noaa.gov for 1 P5-34 st topic.13 REVENUE.2 www.gsa.gov.12 CREDIT.1 www.annualcreditreport.com from term

Conclusions Real data are often in high dimensions with multiple aspects (modes) Tensors provide elegant theory and algorithms PARAFAC and Tucker: discover groups KDD '9 Faloutsos, Miller, Tsourakakis P5-35

References T. G. Kolda, B. W. Bader and J. P. Kenny. Higher-Order Web Link Analysis Using Multilinear Algebra. In: ICDM 25, Pages 242-249, November 25. Jimeng Sun, Spiros Papadimitriou, Philip Yu. Window-based Tensor Analysis on Highdimensional and Multi-aspect Streams, Proc. of the Int. Conf. on Data Mining (ICDM), Hong Kong, China, Dec 26 KDD '9 Faloutsos, Miller, Tsourakakis P5-36

Resources See tutorial on tensors, KDD 7 (w/ Tamara Kolda and Jimeng Sun): www.cs.cmu.edu/~christos/talks/kdd-7-tutorial KDD '9 Faloutsos, Miller, Tsourakakis P5-37

Tensor tools - resources Toolbox: from Tamara Kolda: csmr.ca.sandia.gov/~tgkolda/tensortoolbox T. G. Kolda and B. W. Bader. Tensor Decompositions and Applications. SIAM Review, Volume 51, Number 3, September 29 csmr.ca.sandia.gov/~tgkolda/pubs/bibtgkfiles/tensorreview-preprint.pdf T. Kolda KDD ICDE 9 '9and J. Sun: Scalable Copyright: Faloutsos, Tensor Faloutsos, Miller, Tsourakakis Tong Decomposition (29) for Multi-Aspect P5-38 2-38 Data Mining (ICDM 28)