Large Graph Mining: Power Tools and a Practitioner s guide Task 5: Graphs over time & tensors Faloutsos, Miller, Tsourakakis CMU KDD '9 Faloutsos, Miller, Tsourakakis P5-1
Outline Introduction Motivation Task 1: Node importance Task 2: Community detection Task 3: Recommendations Task 4: Connection sub-graphs Task 5: Mining graphs over time Task 6: Virus/influence propagation Task 7: Spectral graph theory Task 8: Tera/peta graph mining: hadoop Observations patterns of real graphs Conclusions KDD '9 Faloutsos, Miller, Tsourakakis P5-2
Tamara Kolda (Sandia) Thanks to for the foils on tensor definitions, and on TOPHITS KDD '9 Faloutsos, Miller, Tsourakakis P5-3
Detailed outline Motivation Definitions: PARAFAC and Tucker Case study: web mining KDD '9 Faloutsos, Miller, Tsourakakis P5-4
Examples of Matrices: Authors and terms John Peter Mary Nick... data mining classif. tree... 13 11 22 55... 5 4 6 7................................................ KDD '9 Faloutsos, Miller, Tsourakakis P5-5
Motivation: Why tensors? Q: what is a tensor? KDD '9 Faloutsos, Miller, Tsourakakis P5-6
Motivation: Why tensors? A: N-D generalization of matrix: KDD 9 John Peter Mary Nick... data mining classif. tree... 13 11 22 55... 5 4 6 7................................................ KDD '9 Faloutsos, Miller, Tsourakakis P5-7
Motivation: Why tensors? A: N-D generalization of matrix: KDD 7 John Peter Mary Nick... KDD 8 KDD 9 data mining classif. tree... 13 11 22 55... 5 4 6 7................................................ KDD '9 Faloutsos, Miller, Tsourakakis P5-8
Tensors are useful for 3 or more modes Terminology: mode (or aspect ): Mode#3 data mining classif. tree... Mode#2 13 11 22 55... 5 4 6 7................................................ Mode (== aspect) #1 KDD '9 Faloutsos, Miller, Tsourakakis P5-9
Notice 3 rd mode does not need to be time we can have more than 3 modes Dest. port 125 8 IP source 13 11 22 55... 5 4 6 7................................................... IP destination KDD '9 Faloutsos, Miller, Tsourakakis P5-1
Notice 3 rd mode does not need to be time we can have more than 3 modes Eg, ffmri: x,y,z, time, person-id, task-id From DENLAB, Temple U. (Prof. V. Megalooikonomou +) http://denlab.temple.edu/bidms/cgi-bin/browse.cgi KDD '9 Faloutsos, Miller, Tsourakakis P5-11
Motivating Applications Why tensors are useful? web mining (TOPHITS) environmental sensors Intrusion detection (src, dst, time, dest-port) Social networks (src, dst, time, type-of-contact) face recognition etc KDD '9 Faloutsos, Miller, Tsourakakis P5-12
Detailed outline Motivation Definitions: PARAFAC and Tucker Case study: web mining KDD '9 Faloutsos, Miller, Tsourakakis P5-13
Tensor basics Multi-mode extensions of SVD recall that: KDD '9 Faloutsos, Miller, Tsourakakis P5-14
Reminder: SVD n n m A m Σ V T U Best rank-k approximation in L2 KDD '9 Faloutsos, Miller, Tsourakakis P5-15
Reminder: SVD n σ 1 u 1 v 1 σ 2 u 2 v 2 m A + Best rank-k approximation in L2 KDD '9 Faloutsos, Miller, Tsourakakis P5-16
Goal: extension to >=3 modes I x J x K I x R K x R C J x R ¼ B = + + A R x R x R KDD '9 Faloutsos, Miller, Tsourakakis P5-17
Main points: 2 major types of tensor decompositions: PARAFAC and Tucker both can be solved with ``alternating least squares (ALS) KDD '9 Faloutsos, Miller, Tsourakakis P5-18
Specially Structured Tensors Tucker Tensor Kruskal Tensor Our Notation Our Notation core W K x T W K x R I x J x K I x R J x S I x J x K w 1 I x R w R J x R = U R x S x T V = = u 1 v 1 U + + V R x R x R u R v R KDD '9 Faloutsos, Miller, Tsourakakis P5-19
Tucker Decomposition - intuition K x T C I x J x K I x R J x S ¼ A R x S x T B author x keyword x conference A: author x author-group B: keyword x keyword-group C: conf. x conf-group G: how groups relate to each other KDD '9 Faloutsos, Miller, Tsourakakis P5-2
Intuition behind core tensor 2-d case: co-clustering [Dhillon et al. Information-Theoretic Coclustering, KDD 3] KDD '9 Faloutsos, Miller, Tsourakakis P5-21
KDD '9 Faloutsos, Miller, Tsourakakis P5-22 CMU SCS.4.4.4.4.4.4.4.4.4.4.5.5.5.5.5.5.5.5.5.5.5.5.36.36.28.28.36.36.36.36.28 28.36.36.54.54.42.54.54.42.42.54.54.42.54.54.5.5.5.5.5.5.2.2.3 3. [ ].36.36.28.28.36.36 = m m n n l k k l eg, terms x documents
med. doc cs doc term group x doc. group.5.5.4.4.5.5.4.4.5.5.4.5.5.4.5.5.4.4.5.5.4.4 med. terms cs terms common terms.5.5.5.5.5.5 term x term-group [ ]. 3.36.36.28.3.28.36.36.2.2 doc x doc group KDD '9 Faloutsos, Miller, Tsourakakis P5-23 =.54.54.36.36.54.54.36.36.42.42 28.28.42.42.28.28.54.54.36.36.54.54.36.36
Two main tools PARAFAC Tucker Tensor tools - summary Both find row-, column-, tube-groups but in PARAFAC the three groups are identical ( To solve: Alternating Least Squares ) KDD '9 Faloutsos, Miller, Tsourakakis P5-24
Detailed outline Motivation Definitions: PARAFAC and Tucker Case study: web mining KDD '9 Faloutsos, Miller, Tsourakakis P5-25
Web graph mining How to order the importance of web pages? Kleinberg s algorithm HITS PageRank Tensor extension on HITS (TOPHITS) KDD '9 Faloutsos, Miller, Tsourakakis P5-26
Kleinberg s Hubs and Authorities (the HITS method) Sparse adjacency matrix and its SVD: authority scores for 1 st topic authority scores for 2 nd topic to from hub scores for 1 st topic hub scores for 2 nd topic KDD '9 Faloutsos, Miller, Tsourakakis P5-27 Kleinberg, JACM, 1999
from CMU SCS authority scores for 1 st topic to HITS Authorities on Sample Data 1st Principal Factor.97 www.ibm.com.24 www.alphaworks.ibm.com 2nd Principal Factor.8 www-128.ibm.com.99 www.lehigh.edu We started our crawl from.5 www.developer.ibm.com.11 www2.lehigh.edu http://www-neos.mcs.anl.gov/neos,.2 www.research.ibm.com 3rd Principal Factor.6 www.lehighalumni.com and crawled 47 pages,.1 www.redbooks.ibm.com.75 java.sun.com.6 www.lehighsports.com resulting in 56.1 news.com.com.38 www.sun.com.2 www.bethlehem-pa.gov cross-linked hosts..36 developers.sun.com.2 www.adobe.com 4th Principal Factor.24 see.sun.com.2 lewisweb.cc.lehigh.edu.6 www.pueblo.gsa.gov.16 www.samag.com.2 www.leo.lehigh.edu.45 www.whitehouse.gov.13 docs.sun.com.2 www.distance.lehigh.edu.35 www.irs.gov.12 blogs.sun.com.2 fp1.cc.lehigh.edu.31 travel.state.gov 6th Principal Factor.8 sunsolve.sun.com.22 www.gsa.gov.97 mathpost.asu.edu.8 www.sun-catalogue.com.2 www.ssa.gov.18 math.la.asu.edu.8 news.com.com.16 www.census.gov.17 www.asu.edu authority scores.14 www.govbenefits.gov.4 www.act.org for 2 nd topic.13 www.kids.gov.3 www.eas.asu.edu.13 www.usdoj.gov.2 archives.math.utk.edu.2 www.geom.uiuc.edu.2 www.fulton.asu.edu.2 www.amstat.org.2 www.maa.org hub scores KDD '9 Faloutsos, Miller, Tsourakakis for 1 st hub scores topic P5-28 for 2 nd topic
Three-Dimensional View of the Web Observe that this tensor is very sparse! KDD '9 Faloutsos, Miller, Tsourakakis P5-29 Kolda, Bader, Kenny, ICDM5
Three-Dimensional View of the Web Observe that this tensor is very sparse! KDD '9 Faloutsos, Miller, Tsourakakis P5-3 Kolda, Bader, Kenny, ICDM5
Three-Dimensional View of the Web Observe that this tensor is very sparse! KDD '9 Faloutsos, Miller, Tsourakakis P5-31 Kolda, Bader, Kenny, ICDM5
Topical HITS (TOPHITS) Main Idea: Extend the idea behind the HITS model to incorporate term (i.e., topical) information. term term scores for 1 st topic term scores for 2 nd topic to from hub scores for 1 st topic authority scores for 1 st topic authority scores for 2 nd topic hub scores for 2 nd topic KDD '9 Faloutsos, Miller, Tsourakakis P5-32
Topical HITS (TOPHITS) Main Idea: Extend the idea behind the HITS model to incorporate term (i.e., topical) information. term term scores for 1 st topic term scores for 2 nd topic to from hub scores for 1 st topic authority scores for 1 st topic authority scores for 2 nd topic hub scores for 2 nd topic KDD '9 Faloutsos, Miller, Tsourakakis P5-33
TOPHITS Terms & Authorities on Sample Data 1st Principal Factor.23 JAVA.86 java.sun.com.18 SUN.38 developers.sun.com 2nd Principal Factor.17 PLATFORM.16 docs.sun.com.2 NO-READABLE-TEXT.99 www.lehigh.edu TOPHITS uses 3D analysis to find.16 SOLARIS.14 see.sun.com.16 FACULTY.6.16 DEVELOPER.14 www.sun.com 3rd www2.lehigh.edu Principal Factor the dominant groupings of web.16 SEARCH.15 EDITION.15 NO-READABLE-TEXT.3 www.lehighalumni.com.9 www.samag.com.97 www.ibm.com pages and terms..16 NEWS.15 DOWNLOAD.15 IBM.7 developer.sun.com.18 www.alphaworks.ibm.com.16 LIBRARIES.14 INFO.12 SERVICES.6 sunsolve.sun.com.7 4th www-128.ibm.com Principal Factor.16 COMPUTING.12 SOFTWARE.12 WEBSPHERE.5 access1.sun.com.26 INFORMATION.5 www.developer.ibm.com.87 www.pueblo.gsa.gov.12 LEHIGH.12 NO-READABLE-TEXT.12 WEB.5 iforce.sun.com.24 FEDERAL.2 www.redbooks.ibm.com.24 www.irs.gov.11 DEVELOPERWORKS.23 CITIZEN.1 www.research.ibm.com.23 6th www.whitehouse.gov Principal Factor w.11 LINUX.22 OTHER.19 travel.state.gov k = # unique links using term k.26 PRESIDENT.87 www.whitehouse.gov.11 RESOURCES.19 CENTER.25 NO-READABLE-TEXT.18 www.gsa.gov.18 www.irs.gov.11 TECHNOLOGIES.19 LANGUAGES.25 BUSH.9 www.consumer.gov.16 travel.state.gov.1 DOWNLOADS.15 U.S.25 WELCOME.9 www.kids.gov 12th Principal Factor.1 www.gsa.gov.15 PUBLICATIONS.75 OPTIMIZATION.17 WHITE.7 www.ssa.gov.35 www.palisade.com.8 www.ssa.gov.14 CONSUMER.58 SOFTWARE.16 U.S.5 www.forms.gov.35 www.solver.com.5 www.govbenefits.gov.13 FREE.8 DECISION.4 www.govbenefits.gov.33 13th plato.la.asu.edu Principal Factor.15 HOUSE.7 NEOS.46 ADOBE.4 www.census.gov.29 www.mat.univie.ac.at.99 www.adobe.com.13 BUDGET.6 TREE.45 READER.4 www.usdoj.gov.28 www.ilog.com.13 PRESIDENTS.5 GUIDE.45 ACROBAT.4 www.kids.gov.26 www.dashoptimization.com 16th Principal Factor.11 OFFICE.2 www.forms.gov.5 SEARCH.3 FREE.5 WEATHER.26 www.grabitech.com.81 www.weather.gov Tensor PARAFAC.5 ENGINE.3 NO-READABLE-TEXT.24 OFFICE.25 www-fp.mcs.anl.gov.41 www.spc.noaa.gov.5 CONTROL.29 HERE.23 CENTER.22 www.spyderopts.com.3 lwf.ncdc.noaa.gov.5 ILOG.29 COPY.19 NO-READABLE-TEXT.17 www.mosek.com.15 19th www.cpc.ncep.noaa.gov Principal Factor term scores term scores for 1 st topic for 2.5 DOWNLOAD.17 ORGANIZATION.22 TAX.14 www.nhc.noaa.gov.73 www.irs.gov nd topic.15 NWS.17 TAXES.9 www.prh.noaa.gov.43 travel.state.gov to.15 SEVERE.15 CHILD.7 aviationweather.gov.22 www.ssa.gov.15 FIRE.15 RETIREMENT.6 www.nohrsc.nws.gov.8 www.govbenefits.gov authority scores authority scores.15 POLICY.14 BENEFITS.6 www.srh.noaa.gov.6 www.usdoj.gov for 1 st topic for 2 nd topic.14 CLIMATE.14 STATE.3 www.census.gov hub scores.14 INCOME.3 www.usmint.gov for 2 KDD '9 hub scores nd topic Faloutsos, Miller, Tsourakakis.13 SERVICE.2 www.nws.noaa.gov for 1 P5-34 st topic.13 REVENUE.2 www.gsa.gov.12 CREDIT.1 www.annualcreditreport.com from term
Conclusions Real data are often in high dimensions with multiple aspects (modes) Tensors provide elegant theory and algorithms PARAFAC and Tucker: discover groups KDD '9 Faloutsos, Miller, Tsourakakis P5-35
References T. G. Kolda, B. W. Bader and J. P. Kenny. Higher-Order Web Link Analysis Using Multilinear Algebra. In: ICDM 25, Pages 242-249, November 25. Jimeng Sun, Spiros Papadimitriou, Philip Yu. Window-based Tensor Analysis on Highdimensional and Multi-aspect Streams, Proc. of the Int. Conf. on Data Mining (ICDM), Hong Kong, China, Dec 26 KDD '9 Faloutsos, Miller, Tsourakakis P5-36
Resources See tutorial on tensors, KDD 7 (w/ Tamara Kolda and Jimeng Sun): www.cs.cmu.edu/~christos/talks/kdd-7-tutorial KDD '9 Faloutsos, Miller, Tsourakakis P5-37
Tensor tools - resources Toolbox: from Tamara Kolda: csmr.ca.sandia.gov/~tgkolda/tensortoolbox T. G. Kolda and B. W. Bader. Tensor Decompositions and Applications. SIAM Review, Volume 51, Number 3, September 29 csmr.ca.sandia.gov/~tgkolda/pubs/bibtgkfiles/tensorreview-preprint.pdf T. Kolda KDD ICDE 9 '9and J. Sun: Scalable Copyright: Faloutsos, Tensor Faloutsos, Miller, Tsourakakis Tong Decomposition (29) for Multi-Aspect P5-38 2-38 Data Mining (ICDM 28)