Faloutsos, Kolda, Sun

Size: px

Start display at page:

Download "Faloutsos, Kolda, Sun"

Christina Hudson
5 years ago
Views:

Faloutsos, Kolda, Sun SDM'7 About the tutorial Mining Large Time-evolving Data Using Matrix and Tensor Tools Christos Faloutsos Carnegie Mellon Univ. Tamara G.

1 Faloutsos, Kolda, Sun SDM'7 About the tutorial Mining Large Time-evolving Data Using Matrix and Tensor Tools Christos Faloutsos Carnegie Mellon Univ. Tamara G. Kolda Sandia National Labs Jimeng Sun Carnegie Mellon Univ. Introduce matrix and tensor tools through real mining applications Goal: find patterns, rules, clusters, outliers, in matrices and in tensors SDM'7 Faloutsos, Kolda, Sun 1-2 Motivation 1: Why matrix? Why matrices are important? John Peter Mary Nick... Examples of Matrices: Graph - social network John Peter Mary Nick SDM'7 Faloutsos, Kolda, Sun 1-3 SDM'7 Faloutsos, Kolda, Sun 1-4 Examples of Matrices: cloud of n-d points Examples of Matrices: Market basket market basket as in Association Rules John Peter Mary Nick... chol# blood# age John Peter Mary Nick... milk bread choc. wine SDM'7 Faloutsos, Kolda, Sun 1-5 SDM'7 Faloutsos, Kolda, Sun 1-6 1

2 Faloutsos, Kolda, Sun SDM'7 Examples of Matrices: Documents and terms Examples of Matrices: Authors and terms Paper#1 Paper#2 Paper#3 Paper#4... data mining classif. tree John Peter Mary Nick... data mining classif. tree SDM'7 Faloutsos, Kolda, Sun 1-7 SDM'7 Faloutsos, Kolda, Sun 1-8 Examples of Matrices: sensor-ids and time-ticks Motivation 2: Why tensor? Q: what is a tensor? t1 t2 t3 t4... temp1 temp2 humid. pressure SDM'7 Faloutsos, Kolda, Sun 1-9 SDM'7 Faloutsos, Kolda, Sun 1-1 Motivation 2: Why tensor? A: N-D generalization of matrix: Motivation 2: Why tensor? A: N-D generalization of matrix: SDM 7 John Peter Mary Nick... data mining classif. tree John Peter Mary Nick... SDM 5 SDM 6 SDM 7 data mining classif. tree SDM'7 Faloutsos, Kolda, Sun 1-11 SDM'7 Faloutsos, Kolda, Sun

Faloutsos, Kolda, Sun SDM'7 Mode#2 Tensors are useful for 3 or more modes Terminology: mode (or aspect ): Mode#3 data mining classif. tree... 13 11 22 55... 5 4 6 7.

P1: environmental sensors P2: data center monitoring ( autonomic ) P3: social networks P4: network forensics P5: web mining SDM'7 Faloutsos, Kolda, Sun 1-14 3 P1: Environmental sensor monitoring 6

3 Faloutsos, Kolda, Sun SDM'7 Mode#2 Tensors are useful for 3 or more modes Terminology: mode (or aspect ): Mode#3 data mining classif. tree Mode ( aspect) #1 SDM'7 Faloutsos, Kolda, Sun 1-13 Motivating Applications Why matrices are important? Why tensors are useful? P1: environmental sensors P2: data center monitoring ( autonomic ) P3: social networks P4: network forensics P5: web mining SDM'7 Faloutsos, Kolda, Sun P1: Environmental sensor monitoring 6 P2: Clusters/data center monitoring value time (min) Temperature Data in three aspects (time, location, type) value time (min) Light value time (min) time (min) Humidity Voltage SDM'7 Faloutsos, Kolda, Sun 1-15 value Monitor correlations of multiple measurements Automatically flag anomalous behavior Intemon: intelligent monitoring system Prof. Greg Ganger and PDL >1 machines in a data center warsteiner.db.cs.cmu.edu/demo/intemon.jsp SDM'7 Faloutsos, Kolda, Sun 1-16 P3: Social network analysis Traditionally, people focus on static networks and find community structures We plan to monitor the change of the community structure over time 199 Authors 24 DM Keywords DB SDM'7 Faloutsos, Kolda, Sun 1-17 DB P4: Network forensics Directional network flows A large ISP with 1 POPs, each POP 1Gbps link capacity [Hotnets24] 45 GB/hour with compression Task: Identify abnormal traffic pattern and find out the cause abnormal traffic normal traffic 5 45 destination source source destination source SDM'7 Faloutsos, Kolda, Sun Collaboration with Prof. Hui Zhang and Dr. Yinglian Xie 1-18 destination source 3

4 Faloutsos, Kolda, Sun SDM'7 P5: Web graph mining How to order the importance of web pages? Kleinberg s algorithm HITS PageRank Tensor extension on HITS (TOPHITS) context-sensitive hypergraph analysis Order Correspondence Static Data model Tensor Formally, Generalization of matrices Represented as multi-array, (~ data cube). 1st Vector 2 nd Matrix 3 rd 3D array Example Sensors Authors Keywords Destinations Ports SDM'7 Faloutsos, Kolda, Sun 1-19 Sources SDM'7 Faloutsos, Kolda, Sun 1-2 Order Correspondence Example Dynamic Data model Tensor Streams A sequence of Mth order tensor where t is increasing over time 1st Multiple streams Sensors time author Time evolving graphs 3 rd 3D arrays SDM'7 Faloutsos, Kolda, Sun nd keyword time Destinations Ports Sources Motivation Matrix tools Tensor basics Tensor extensions Software demo Case studies Roadmap SDM'7 Faloutsos, Kolda, Sun

5 Motivation Matrix tools Tensor basics Tensor extensions Software demo Case studies Roadmap SVD, PCA HITS, PageRank CUR Co-clustering Nonnegative Matrix factorization Patterns Anomaly detection Compression General goals SDM'7 Faloutsos, Kolda, Sun 2-1 SDM'7 Faloutsos, Kolda, Sun 2-2 Examples of Matrices Example/Intuition: Documents and terms Find patterns, groups, concepts Paper#1 Paper#2 Paper#3 Paper#4... data mining classif. tree SDM'7 Faloutsos, Kolda, Sun 2-3 Singular Value Decomposition (SVD) X x (1) x (2) x (M) input data U u 1 u 2 u k left singular vectors X UΣV T. σ 1 σ 2 SDM'7 Faloutsos, Kolda, Sun 2-4 Σ σ k singular values. v 1 v 2 v k V T right singular vectors SVD as spectral decomposition SVD - Example m n A n σ 1 u 1 v 1 σ 2 u 2 v 2 Σ m + V T U Best rank-k approximation in L2 and Frobenius SVD only works for static matrices (a single 2 nd order tensor) CS MD A U Σ V T - example: lung data infṛetrieval brain x x SDM'7 Faloutsos, Kolda, Sun 2-5 See also PARAFAC SDM'7 Faloutsos, Kolda, Sun 2-6 1

6 SVD - Example A U Σ V T - example: data infṛetrieval CS-concept brain lung MD-concept CS x x MD SDM'7 Faloutsos, Kolda, Sun 2-7 SVD - Example A U Σ V T - example: doc-to-concept similarity matrix data infṛetrieval CS-concept brain lung MD-concept CS x x MD SDM'7 Faloutsos, Kolda, Sun 2-8 SVD - Example A U Σ V T - example: data infṛetrieval brain lung strength of CS-concept CS x x MD SDM'7 Faloutsos, Kolda, Sun 2-9 CS MD SVD - Example A U Σ V T - example: lung data infṛetrieval brain CS-concept SDM'7 Faloutsos, Kolda, Sun 2-1 x term-to-concept similarity matrix x CS MD SVD - Example A U Σ V T - example: lung data infṛetrieval brain CS-concept SDM'7 Faloutsos, Kolda, Sun 2-11 x term-to-concept similarity matrix x SVD properties V are the eigenvectors of the covariance matrix X T X, since U are the eigenvectors of the Gram (innerproduct) matrix XX T, since Further reading: 1. Ian T. Jolliffe, Principal Component Analysis (2 nd ed), Springer, 22. SDM'7 Faloutsos, Kolda, Sun 2. Gilbert Strang, Linear Algebra and Its Applications (4 th ed), Brooks Cole,

V are dense and may have negative entries SDM'7 Faloutsos, Kolda, Sun 2-14 Σ n V T Loading kr PCA interpretation best axis to project on: ( best min sum of squares of projection errors) Term2 ( lung

7 SVD - Interpretation documents, terms and concepts : Q: if A is the document-to-term matrix, what is A T A? A: term-to-term ([m x m]) similarity matrix Q: A A T? A: document-to-document ([n x n]) similarity matrix SDM'7 Faloutsos, Kolda, Sun 2-13 Principal Component Analysis (PCA) SVD m n A m k R U kr PCs PCA is an important application of SVD Note that U and V are dense and may have negative entries SDM'7 Faloutsos, Kolda, Sun 2-14 Σ n V T Loading kr PCA interpretation best axis to project on: ( best min sum of squares of projection errors) Term2 ( lung ) PCA - interpretation Term2 ( lung ) PCA projects points Onto the best axis first singular vector v1 minimum RMS error Term1 ( data ) Term1 ( data ) SDM'7 Faloutsos, Kolda, Sun 2-15 SDM'7 Faloutsos, Kolda, Sun 2-16 Motivation Matrix tools Tensor basics Tensor extensions Software demo Case studies Roadmap SVD, PCA HITS, PageRank CUR Co-clustering Nonnegative Matrix factorization Kleinberg s algorithm HITS Problem dfn: given the web and a query find the most authoritative web pages for this query Step : find all pages containing the query terms Step 1: expand by one move forward and backward SDM'7 Faloutsos, Kolda, Sun 2-17 Further SDM'7 reading: Faloutsos, Kolda, Sun J. Kleinberg. Authoritative sources in a hyperlinked environment. SODA

8 Kleinberg s algorithm HITS Step 1: expand by one move forward and backward Kleinberg s algorithm HITS on the resulting graph, give high score ( authorities ) to nodes that many important nodes point to give high importance score ( hubs ) to nodes that point to good authorities hubs authorities SDM'7 Faloutsos, Kolda, Sun 2-19 SDM'7 Faloutsos, Kolda, Sun 2-2 Kleinberg s algorithm HITS observations recursive definition! each node (say, i -th node) has both an authoritativeness score a i and a hubness score h i Kleinberg s algorithm: HITS Let A be the adjacency matrix: the (i,j) entry is 1 if the edge from i to j exists Let h and a be [n x 1] vectors with the hubness and authoritativiness scores. Then: SDM'7 Faloutsos, Kolda, Sun 2-21 SDM'7 Faloutsos, Kolda, Sun 2-22 Kleinberg s algorithm: HITS Kleinberg s algorithm: HITS k l m i Then: a i h k + h l + h m that is a i Sum (h j ) over all j that (j,i) edge exists or a A T h i n q p symmetrically, for the hubness : h i a n + a p + a q that is h i Sum (q j ) over all j that (i,j) edge exists or h Aa SDM'7 Faloutsos, Kolda, Sun 2-23 SDM'7 Faloutsos, Kolda, Sun

9 Kleinberg s algorithm: HITS In conclusion, we want vectors h and a such that: h A a a A T h That is: a A T A a SDM'7 Faloutsos, Kolda, Sun 2-25 Kleinberg s algorithm: HITS a is a right singular vector of the adjacency matrix A (by dfn!), a.k.a the eigenvector of A T A Starting from random a and iterating, we ll eventually converge Q: to which of all the eigenvectors? why? A: to the one of the strongest eigenvalue, (A T A ) k v ~ (constant) v 1 SDM'7 Faloutsos, Kolda, Sun 2-26 Kleinberg s algorithm - discussion authority score can be used to find similar pages (how?) closely related to citation analysis, social networks / small world phenomena Motivation Matrix tools Tensor basics Tensor extensions Software demo Case studies Roadmap SVD, PCA HITS,PageRank CUR Co-clustering Nonnegative Matrix factorization See also TOPHITS SDM'7 Faloutsos, Kolda, Sun 2-27 SDM'7 Faloutsos, Kolda, Sun 2-28 Motivating problem: PageRank Given a directed graph, find its most interesting/central node A node is important, if it is connected with important nodes (recursive, but OK!) SDM'7 Faloutsos, Kolda, Sun 2-29 Motivating problem PageRank solution Given a directed graph, find its most interesting/central node Proposed solution: Random walk; spot most popular node (-> steady state prob. (ssp)) A node has high ssp, if it is connected with high ssp nodes (recursive, but OK!) SDM'7 Faloutsos, Kolda, Sun 2-3 5

10 (Simplified) PageRank algorithm Let A be the transition matrix ( adjacency matrix); let A T become column-normalized - then From A T To 1 p p2 3 1/2 1/2 p3 4 5 p1 p2 p3 1/2 p4 p4 1/2 p5 p5 (Simplified) PageRank algorithm A T p p A T p p p1 1 1 p2 1/2 1/2 1/2 1/2 p3 p4 p5 p1 p2 p3 p4 p5 SDM'7 Faloutsos, Kolda, Sun 2-31 SDM'7 Faloutsos, Kolda, Sun 2-32 (Simplified) PageRank algorithm A T p 1 * p thus, p is the eigenvector that corresponds to the highest eigenvalue (1, since the matrix is column-normalized) Why does it exist such a p? p exists if A is nxn, nonnegative, irreducible [Perron Frobenius theorem] (Simplified) PageRank algorithm In short: imagine a particle randomly moving along the edges compute its steady-state probabilities (ssp) Full version of algo: with occasional random jumps Why? To make the matrix irreducible SDM'7 Faloutsos, Kolda, Sun 2-33 SDM'7 Faloutsos, Kolda, Sun 2-34 Full Algorithm With probability 1-c, fly-out to a random node Then, we have p c A p+ (1-c)/n 1 > p (1-c)/n [I - c A] -1 1 Motivation Matrix tools Tensor basics Tensor extensions Software demo Case studies Roadmap SVD, PCA HITS, PageRank CUR Co-clustering Nonnegative Matrix factorization SDM'7 Faloutsos, Kolda, Sun 2-35 SDM'7 Faloutsos, Kolda, Sun

Motivation of CUR or CMD SVD, PCA all transform data into some abstract space (specified by a set basis) Interpretability problem Loss of sparsity Interpretability problem Each column of projection

11 Motivation of CUR or CMD SVD, PCA all transform data into some abstract space (specified by a set basis) Interpretability problem Loss of sparsity Interpretability problem Each column of projection matrix U i is a linear combination of all dimensions along certain mode U i (:,1) [.5; -.5;.5;.5] All the data are projected onto the span of U i It is hard to interpret the projections SDM'7 Faloutsos, Kolda, Sun 2-37 SDM'7 Faloutsos, Kolda, Sun 2-38 PCA projects points Onto the best axis minimum RMS error PCA - interpretation Term2 ( lung ) v1 first singular vector Term1 ( data ) SDM'7 Faloutsos, Kolda, Sun 2-39 The sparsity property sparse and small SVD: A U Σ V T Big but sparse Big and dense dense but small CUR: A C U R Big but sparse Big but sparse SDM'7 Faloutsos, Kolda, Sun 2-4 The sparsity property pictorially: The sparsity property (cont.) SVD/PCA: Destroys sparsity space ratio SVD CUR CMD space ratio 1 2 SVD CUR CMD U Σ V T C U R CUR: maintains sparsity SDM'7 Faloutsos, Kolda, Sun accuracy Network DBLP CMD uses much smaller space to achieve the same accuracy CUR limitation: duplicate columns and rows SVD limitation: orthogonal projection densifies the data Reference: SDM'7 Faloutsos, Kolda, Sun 2-42 Sun et al. Less is More: Compact Matrix Decomposition for Large Sparse Graphs, SDM accuracy 7

12 CUR Example-based projection: use actual rows and columns to specify the subspace Given a matrix A R m n, find three matrices C R m c, U R c r, R R r n, such that A-CUR is small m n A m X C n R c U is the pseudo-inverse of X SDM'7 Faloutsos, Kolda, Sun 2-43 r Example-based Orthogonal projection CUR (cont.) Key question: How to select/sample the columns and rows? Uniform sampling Biased sampling CUR w/ absolute error bound CUR w/ relative error bound Reference: 1. Tutorial: Randomized Algorithms for Matrices and Massive Datasets, SDM 6 2. Drineas et al. Subspace Sampling and Relative-error Matrix Approximation: Column- Row-Based Methods, ESA26 3. Drineas SDM'7et al., Fast Monte Carlo Algorithms Faloutsos, Kolda, for Sun Matrices III: Computing a 2-44 Compressed Approximate Matrix Decomposition, SIAM Journal on Computing, 26. Motivation Matrix tools Tensor basics Tensor extensions Software demo Case studies Roadmap SVD, PCA HITS, PageRank CUR Co-clustering etc Nonnegative Matrix factorization Co-clustering Let X and Y be discrete random variables X and Y take values in {1, 2,, m} and {1, 2,, n} p(x, Y) denotes the joint probability distribution if not known, it is often estimated based on co-occurrence data Application areas: text mining, market-basket analysis, analysis of browsing behavior, etc. Key Obstacles in Clustering Contingency Tables High Dimensionality, Sparsity, Noise Need for robust and scalable algorithms SDM'7 Faloutsos, Kolda, Sun 2-45 Reference: SDM'7 Faloutsos, Kolda, Sun Dhillon et al. Information-Theoretic Co-clustering, KDD 3 Co-clustering Given data matrix and the number of row and column groups k and l Simultaneously Cluster rows of p(x, Y) into k disjoint groups Cluster columns of p(x, Y) into l disjoint groups Key goal is to exploit the duality between row and column clustering to overcome sparsity and noise SDM'7 Faloutsos, Kolda, Sun 2-47 Information Theory Concepts Entropy of a random variable X with probability distribution p: H ( p) p( x) log p( x) x The Kullback-Leibler (KL) Divergence or Relative Entropy between two probability distributions p and q: KL ( p, q) p( x)log( p( x) q( x)) x Mutual Information between random variables X and Y: I( X, Y ) p( x, y)log p( x, y) details x y p( x) p( y) SDM'7 Faloutsos, Kolda, Sun

Information-Theoretic Co-Clustering View (scaled) co-occurrence matrix as a joint probability distribution between row & column random variables # co occurence( x, y) p( x, y) # co occurence( x, y) Y

clusters SDM'7 Faloutsos, Kolda, Sun 2-49 Xˆ details Information Theoretic Co-clustering Loss in mutual information equals details I ( X, Y ) - I ( Xˆ, Yˆ) KL( p( x, y) q( x, y)) H ( Xˆ, Yˆ) + H ( X

subject to cluster constraints. SDM'7 Faloutsos, Kolda, Sun 2-5.5.5 p( x, y).5.5.5.5.5.5.4.4.4.4.5.5.5.5.5.5.4.4.4.4.4.4.5.5.5.5 p( x xˆ). 3.36.36.28.3 [ ].2.2 p( xˆ, yˆ).28.36.36 p( y yˆ) SDM'7 Faloutsos, Kolda, Sun 2-51.

13 Information-Theoretic Co-Clustering View (scaled) co-occurrence matrix as a joint probability distribution between row & column random variables # co occurence( x, y) p( x, y) # co occurence( x, y) Y x, y Yˆ X We seek a hard-clustering of both dimensions such that loss in Mutual Information I ( X, Y ) - I ( Xˆ, Yˆ ) is minimized given a fixed no. of row & col. clusters SDM'7 Faloutsos, Kolda, Sun 2-49 Xˆ details Information Theoretic Co-clustering Loss in mutual information equals details I ( X, Y ) - I ( Xˆ, Yˆ) KL( p( x, y) q( x, y)) H ( Xˆ, Yˆ) + H ( X X ˆ ) + H ( Y Yˆ) H ( X, Y ) p is the input distribution q is an approximation to p q( x, y) p(ˆ, x yˆ) p( x xˆ) p( y yˆ), x xˆ, y yˆ Can be shown that q(x,y) is a maximum entropy approximation subject to cluster constraints. SDM'7 Faloutsos, Kolda, Sun p( x, y) p( x xˆ) [ ].2.2 p( xˆ, yˆ) p( y yˆ) SDM'7 Faloutsos, Kolda, Sun q( x, y) #parameters that determine q(x,y) are: ( m k) + ( kl 1) + ( n l) Problem with Information Theoretic Co-clustering Number of row and column groups must be specified Desiderata: Simultaneously discover row and column groups Fully Automatic: No magic numbers Scalable to large graphs SDM'7 Faloutsos, Kolda, Sun 2-52 Cross-association What makes a cross-association good? Row groups versus Row groups Why is this better? Desiderata: Simultaneously discover row and column groups Fully Automatic: No magic numbers Scalable to large matrices Column groups Column groups Reference: SDM'7 Faloutsos, Kolda, Sun Chakrabarti et al. Fully Automatic Cross-Associations, KDD 4 SDM'7 Faloutsos, Kolda, Sun

14 What makes a cross-association good? What makes a cross-association good? Row groups Column groups versus Row groups Column groups Why is this better? simpler; easier to describe easier to compress! Problem definition: given an encoding scheme decide on the # of col. and row groups k and l and reorder rows and columns, to achieve best compression SDM'7 Faloutsos, Kolda, Sun 2-55 SDM'7 Faloutsos, Kolda, Sun 2-56 details Main Idea Algorithm Good Compression Total Encoding Cost Σ size i * H(x i ) + i Code Cost Better Clustering Cost of describing cross-associations Description Cost l 5 col groups k 5 row groups Minimize the total cost (# bits) for lossless compression k1, l2 k2, l2 k2, l3 k3, l3 k3, l4 k4, l4 k4, l5 SDM'7 Faloutsos, Kolda, Sun 2-57 SDM'7 Faloutsos, Kolda, Sun 2-58 Algorithm Code for cross-associations (matlab): Cross-Associations vs. Co-clustering Information-theoretic co-clustering Cross-Associations ations tgz Variations and extensions: Autopart [Chakrabarti, PKDD 4] 1. For any nonnegative matrix 2. Lossy Compression. 3. Approximates the original matrix, while trying to minimize KLdivergence. 4. The number of row and column groups must be given by the user. 1. For binary matrix 2. Lossless Compression. 3. Always provides complete information about the matrix, for any number of row and column groups. 4. Chosen automatically using the MDL principle. SDM'7 Faloutsos, Kolda, Sun 2-59 SDM'7 Faloutsos, Kolda, Sun 2-6 1

15 Motivation Matrix tools Tensor basics Tensor extensions Software demo Case studies Roadmap SVD, PCA HITS, PageRank CUR Co-clustering, etc Nonnegative Matrix factorization Nonnegative Matrix Factorization Coming up soon with nonnegative tensor factorization SDM'7 Faloutsos, Kolda, Sun 2-61 SDM'7 Faloutsos, Kolda, Sun

Motivation Matrix tools Tensor basics Tensor extensions Software demo Case studies Roadmap Tensor Basics Tucker Tucker 1 (PCA) Tucker 2 Tucker 3 (HOSVD) PARAFAC Tensor Basics 3-1 A

(i,j,k) (i,j ) X (n) : The mode-n fibers are rearranged to be the columns of a matrix I x ijk Reverse Matricize (i,j ) (i,j,k) J 3 rd order tensor mode 1 has dimension I mode 2 has

Horizontal Slices Lateral Slices Frontal Slices 3-3 5 7 1 3 6 8 2 4 3-4 Tensor Mode-n Multiplication Pictorial View of Mode-n Matrix Multiplication Tensor Times Matrix Tensor Times

16 Motivation Matrix tools Tensor basics Tensor extensions Software demo Case studies Roadmap Tensor Basics Tucker Tucker 1 (PCA) Tucker 2 Tucker 3 (HOSVD) PARAFAC Tensor Basics 3-1 A tensor is a multidimensional array An I J K tensor K Column (Mode-1) Fibers Row (Mode-2) Fibers Tube (Mode-3) Fibers Matricize (unfolding) Matricize: Converting a Tensor to a Matrix (i,j,k) (i,j ) X (n) : The mode-n fibers are rearranged to be the columns of a matrix I x ijk Reverse Matricize (i,j ) (i,j,k) J 3 rd order tensor mode 1 has dimension I mode 2 has dimension J mode 3 has dimension K Note: Tutorial focus is on 3 dimensions, but everything can be extended to higher dimensionality. Horizontal Slices Lateral Slices Frontal Slices Tensor Mode-n Multiplication Pictorial View of Mode-n Matrix Multiplication Tensor Times Matrix Tensor Times Vector Mode-2 multiplication (lateral slices) Multiply each row (mode-2) fiber by B Compute the dot product of a and each column (mode-1) fiber 3-5 Mode-1 multiplication (frontal slices) Mode-3 multiplication (horizontal slices) 3-6 1

17 Mode-n product Example Tensor times a matrix Mode-n product Example Tensor times a vector Location Type Time time Time clusters Location Type clusters Location Type Time time Time Location Type Outer Product Outer, Kronecker, & Khatri-Rao Products Kronecker Product M x N P x Q Specially Structured Tensors MP x NQ Khatri-Rao Product M x R N x R MN x R Observe: a b and a b have the same elements, but one is shaped into a matrix and the other into a vector. 3-9 Specially Structured Tensors Tucker Tensor Kruskal Tensor Specially Structured Tensors Tucker Tensor Kruskal Tensor core I x J x K I x R W K x T J x S I x J x K I x R K x R W J x R In matrix form: In matrix form: U V R x S x T U + + V R x R x R

18 Matrix SVD is a Tucker or Kruskal Tensor Matrix SVD: Tensor Decompositions Tucker Tensor: Kruskal Tensor: 3-13 I x J x K Tucker Decomposition I x R J x S A K x T C B R x S x T Tucker Variations See Kroonenberg & De Leeuw, Psychometrika, 198 for discussion. Tucker2 I x J x K I x R J x S B A R x S x K Proposed by Tucker (1966) AKA: Three-mode factor analysis, three-mode PCA, orthogonal array decomposition A, B, and C generally assumed to be orthonormal (generally assume they have full column rank) is not diagonal Not unique Recall the equations for converting a tensor to a matrix 3-15 Tucker1 I x J x K I x R A R x J x K Finding principal components in only mode 1. Can be solved via rank-r matrix SVD 3-16 I x J x K I x R A C K x T R x S x T Higher Order SVD (HO-SVD) J x S B Not optimal, but often used to initialize Tucker- ALS algorithm. (Observe connection to Tucker1.) Minimize s.t. A,B,C orthonormal Solving for Tucker Maximize s.t. A,B,C orthonormal To solve for A (assuming B and C are fixed): De Lathauwer, De Moor, & Vandewalle, SIMAX, Calculate R leading left singular vectors of

19 Alternating Least Squares (ALS) for Tucker I x J x K I x R A C K x T R x S x T J x S B Initialize Choose R, S, T Calculate A, B, C via HO-SVD Until converged do A R leading left singular vectors of X (1) (C B) B S leading left singular vectors of X (2) (C A) C T leading left singular vectors of X (3) (B A) Kroonenberg & De Leeuw, Psychometrika, CANDECOMP/PARAFAC Decomposition I x J x K I x R A C K x R R x R x R J x R CANDECOMP Canonical Decomposition (Carroll & Chang, 197) PARAFAC Parallel Factors (Harshman, 197) Core is diagonal (specified by the vector λ) Columns of A, B, and C are not orthonormal If R is minimal, then R is called the rank of the tensor (Kruskal 1977) Can have rank( ) > min{i,j,k} 3-2 B + + Alternating Least Squares (ALS) for PARAFAC + + To solve for A, consider: I x J x K R rank-1 tensors By the properties of the Khatri-Rao product, we have: Thus: Do the same for B and C, then repeat

20 Motivation Matrix tools Tensor basics Tensor extensions Software demo Case studies Roadmap Other decompositions Nonnegativity Missing values Incrementalization Other Tensor Decompositions 4-1 M x N x P Combining Tucker & PARAFAC M x I N x J V Step 1: Choose orthonormal matrices U, V, W to compress tensor (Tucker tensor!) Typically HO-SVD can be U used I x J x K Step 2: Run PARAFAC on smaller tensor W P x K Step 3: Reassemble result 3-Way DEDICOM roles time patterns 3-way DEDICOM 2-way DEDICOM introduced by Harshman, way DEDICOM due to Kiers, 1993 Idea is to capture asymmetric relationships among different roles If third dimension is time, than diagonal slices capture participation of each role at each time Bro and Andersson, See, e.g., Bader, Harshman, Kolda, SAND Dense Tensors Computations with Tensors Largest tensor that can be stored on a laptop is 2 x 2 x 2 Typically, tensor operations are reduced to matrix operations Requires permuting and reshaping the tensor Example: Mode-n tensormatrix multiply I x J x K Example: Mode-1 Matrix Multiply M x J x K I x J x K M x I M x JK M x I I x JK 4-6 1

Sparse Tensors: Only Store Nonzeros Example: Tensor-Vector Multiply (in all modes) Tucker Tensors: Store Core & Factors Tucker tensor stores the core (which can

Store just the nonzeros of a tensor (assume coordinate format) pth nonzero 1 st subscript of pth nonzero 2 nd subscript of pth nonzero 3 rd subscript of pth

Kruskal tensors store factor matrices and scaling vector.

Update formulas (do not increase objective function): Minimize subject to elements of A and B being positive.

21 Sparse Tensors: Only Store Nonzeros Example: Tensor-Vector Multiply (in all modes) Tucker Tensors: Store Core & Factors Tucker tensor stores the core (which can be dense, sparse, or structured) and the factors. Store just the nonzeros of a tensor (assume coordinate format) pth nonzero 1 st subscript of pth nonzero 2 nd subscript of pth nonzero 3 rd subscript of pth nonzero 4-7 Example: Mode-3 Tensor-Vector Multiply Result is a Tucker Tensor 4-8 I x J x K Kruskal Example: Store Factors W K x R I x R J x R V U R x R x R Kruskal tensors store factor matrices and scaling vector. + + Nonnegativity Example: Norm I x R J x R K x R I x J x K R x R R x R R x R 4-9 Non-negative Matrix Factorization Non-negative 3-Way PARAFAC Factorization Update formulas (do not increase objective function): Minimize subject to elements of A and B being positive. Minimize subject to elements of A, B and C being positive. Lee-Seung-like update formulas can be derived for 3D and higher: Elementwise multiply (Hadamard product) Elementwise divide Elementwise multiply (Hadamard product) Elementwise divide Lee & Seung, Nature, M. Mørup, L. K. Hansen, J. Parnas, S. M. Arnfred, Decomposing the time-frequency representation of EEG using non-negative 4-12 matrix and multi-way factorization, 26 2

22 Handling Missing Data A Quick Overview on Handling Missing Data Consider sparse PARAFAC where all the zero entries represent missing data Typically, missing values are just set to zero There are more sophisticated approaches for handling missing values: Weighted approximation Data imputation to estimate missing values 4-14 See, e.g., Kiers, Psychometrika, 1997 and Srebro & Jaakkola, ICML 23 Weighted Least Squares Missing Value Imputation Weight Tensor Use the current estimate to fill in the missing values Current Estimate Weight the least squares problem so that the missing elements are ignored: The tensor for the next iteration of the algorithm is: Known Values Estimates of Unknowns Weighted Least Squares But this problem is often too hard to solve directly! 4-15 Sparse! Kruskal Tensor Challenge is finding a good initial estimate 4-16 Motivation Matrix tools Tensor basics Tensor extensions Software demo Case studies Roadmap Tensor Toolbox for MATLAB Six object-oriented tensor classes Working with tensors is easy Most comprehensive set of kernel operations in any language E.g., arithmetic, logical, multiplication operations Sparse tensors are unique Speed-ups of two orders of magnitude for smaller problems Larger problems than ever before Free for research or evaluations purposes 297 unique registered users from all over the world (as of January 17, 26) 4-17 Bader & Kolda, ACM TOMS 26 & SAND

23 Incrementalization Incremental Tensor Decomposition Dynamic data model Tensor Streams Dynamic Tensor Decomposition (DTA) Streaming Tensor Decomposition (STA) Window-based Tensor Decomposition (WTA) 4-2 Destination Source Dynamic Tensor Stream time Streams come with structure (time, source, destination, port) (time, author, keyword) How to summarize tensor streams effectively and incrementally? 4-21 Order Correspondence Example Dynamic Data model Tensor Streams A sequence of Mth order tensor where n is increasing over time 1st Multiple streams Sensors time author 2 nd Time evolving graphs keyword time Destinations Ports 3 rd 3D arrays Sources 4-22 Incremental Tensor Decomposition Destination Source Old Tensors New Tensor Old cores U Destination 1 st order DTA - problem Given x 1 x n where each x i R N, find U R N R such that the error e is small: time n N x 1.? Y U T Sensors R x n Sensors indoor outdoor Note that Y XU U Source

T time 1 st order Dynamic Tensor Analysis Input: new data vector x R N, old variance matrix C R N N Output: new projection matrix U R N R Old X x Algorithm: 1. update variance matrix C new x T x+ C 2.

Determine the rank R and return U M th order DTA Reconstruct Variance Matrix T U d U d S d C d Construct Variance Matrix of Incremental Tensor Matricizing Diagonalize Variance Matrix x x T C U C new

24 T time 1 st order Dynamic Tensor Analysis Input: new data vector x R N, old variance matrix C R N N Output: new projection matrix U R N R Old X x Algorithm: 1. update variance matrix C new x T x+ C 2. Diagonalize UΛU T C new 3. Determine the rank R and return U M th order DTA Reconstruct Variance Matrix T U d U d S d C d Construct Variance Matrix of Incremental Tensor Matricizing Diagonalize Variance Matrix x x T C U C new U T Matricizing, Transpose X(d)X(d) C d Update Variance Matrix T U d U d S d Diagonalization has to be done for every new x! T X( d ) X( d ) M th order DTA complexity Storage: O( N i ), i.e., size of an input tensor at a single timestamp Computation: N i3 (or N i2 ) diagonalization of C + N i N i matrix multiplication X (d)t X (d) For low order tensor(<3), diagonalization is the main cost For high order tensor, matrix multiplication is the main cost st order Streaming Tensor Analysis (STA) Adjust U smoothly when new data arrive without diagonalization [VLDB5] For each new point x Project onto current line Estimate error Rotate line in the direction of the error and in proportion to its magnitude For each new point x and for i 1,, k : y i : U it x (proj. onto U i ) d i λd i + y 2 i (energy i-th eigenval.) e i : x y i U i (error) U i U i + (1/d i ) y i e i (update estimate) x x y i U i (repeat with remainder) U error Sensor Sensor 2 T X( d ) Matricizing M th order STA Meta-algorithm for window-based tensor analysis Type Run 1 st order STA along each mode x U Complexity: 1 updated e Storage: O( N i ) 1 U 1 Computation: R i N i which is smaller than DTA y 1 R 1R2 R Location N 2 Y Time U Type W D U Time N U Location 4-3 5

25 Moving Window scheme (MW) Update the variance matrix C (i) incrementally Diagonalize C(i) to find U(i) old C d Tensor Streams D n-w... D(n-1,W) D(n,W) Time D n new C d... A good and efficient initialization Update variance matrix U (d) Diagonalize

26 Motivation Matrix tools Tensor basics Tensor extensions Software demo Case studies Roadmap value value P1: Environmental sensor monitoring time (min) Temperature value value time (min) 2 1 Light 1.5 SDM'7 Faloutsos, Kolda, Sun time (min) time (min) Humidity Voltage SDM'7 Faloutsos, Kolda, Sun 5-2 Location Type 1 st factor Scaling factor 25 Time value P2: sensor monitoring time time (min) value location Volt Humid Temp Light type SDM'7 Faloutsos, Kolda, Sun location value type 1 st factor consists of the main trends: Daily periodicity on time Uniform on all locations Temp, Light and Volt are positively correlated while negatively correlated with Humid 2 nd factor Scaling factor 154 P2: sensor monitoring value time time (min) location Volt Humid Temp Light type 2 nd factor captures an atypical trend: Uniformly across all time Concentrating on 3 locations Mainly due to voltage Interpretation: two sensors have low battery, and the other one has high battery. SDM'7 Faloutsos, Kolda, Sun 5-4 value type P3: Social network analysis Multiway latent semantic indexing (LSI) Monitor the change of the community structure over time time 199 Authors 24 DB DM Keywords DB Pattern UK UA 199 Query Philip Yu Michael Stonebreaker 24 P3: Social network analysis (cont.) Authors michael carey, michael stonebreaker, h. jagadish, hector garcia-molina surajit chaudhuri,mitch cherniack,michael stonebreaker,ugur etintemel jiawei han,jian pei,philip s. yu, jianyong wang,charu c. aggarwal DB Keywords queri,parallel,optimization,concurr, objectorient distribut,systems,view,storage,servic,process, cache streams,pattern,support, cluster, index,gener,queri DM Year Two groups are correctly identified: Databases and Data mining People and concepts are drifting over time SDM'7 Faloutsos, Kolda, Sun 5-5 SDM'7 Faloutsos, Kolda, Sun 5-6 1

destination 5 45 4 35 3 25 2 15 1 5 P4: Network anomaly detection 1 2 3 4 5 source error 5 4 3 2 1 2 4 6 8 1 12 hours 1 2 3 4 5

Prominent difference between normal and abnormal ones is mainly due to the unusual scanning activity (confirmed by the campus admin).

Kleinberg s algorithm HITS PageRank Tensor extension on HITS (TOPHITS) SDM'7 Faloutsos, Kolda, Sun 5-7 SDM'7 Faloutsos, Kolda, Sun

scores for 1 st topic authority scores for 2 nd topic hub scores for 2 nd topic SDM'7 Faloutsos, Kolda, Sun 5-9 Kleinberg, JACM, 1999

8 www-128.ibm.com.99 www.lehigh.edu We started our crawl from.5 www.developer.ibm.com.11 www2.lehigh.edu http://www-neos.mcs.anl.

com.6 www.lehighsports.com resulting in 56.1 news.com.com.38 www.sun.com.2 www.bethlehem-pa.gov cross-linked hosts..36 developers.sun.com.2 www.adobe.

whitehouse.gov.13 docs.sun.com.2 www.distance.lehigh.edu.35 www.irs.gov.12 blogs.sun.com.2 fp1.cc.lehigh.edu.31 travel.state.

com.com.16 www.census.gov.17 www.asu.edu authority scores.14 www.govbenefits.gov.4 www.act.org authority scores for 2 nd topic.13 www.

27 destination P4: Network anomaly detection source error hours source Abnormal traffic Reconstruction error Normal traffic over time Reconstruction error gives indication of anomalies. Prominent difference between normal and abnormal ones is mainly due to the unusual scanning activity (confirmed by the campus admin). destination P5: Web graph mining How to order the importance of web pages? Kleinberg s algorithm HITS PageRank Tensor extension on HITS (TOPHITS) SDM'7 Faloutsos, Kolda, Sun 5-7 SDM'7 Faloutsos, Kolda, Sun 5-8 Kleinberg s Hubs and Authorities (the HITS method) Sparse adjacency matrix and its SVD: authority scores for 1 st topic hub scores for 1 st topic authority scores for 2 nd topic hub scores for 2 nd topic SDM'7 Faloutsos, Kolda, Sun 5-9 Kleinberg, JACM, 1999 from to from HITS Authorities on Sample Data 1st Principal Factor nd Principal Factor.8 www-128.ibm.com.99 We started our crawl from.5 www2.lehigh.edu 3rd Principal Factor.6 and crawled 47 pages,.1 java.sun.com.6 resulting in 56.1 news.com.com cross-linked hosts..36 developers.sun.com.2 4th Principal Factor.24 see.sun.com.2 lewisweb.cc.lehigh.edu docs.sun.com blogs.sun.com.2 fp1.cc.lehigh.edu.31 travel.state.gov 6th Principal Factor.8 sunsolve.sun.com.22 mathpost.asu.edu math.la.asu.edu.8 news.com.com authority scores authority scores for 2 nd topic for 1 st topic.13 archives.math.utk.edu.2 to hub scores SDM'7 hub scores for 1 st topic for 2 nd topic Faloutsos, Kolda, Sun 5-1 Three-Dimensional View of the Web Topical HITS (TOPHITS) Main Idea: Extend the idea behind the HITS model to incorporate term (i.e., topical) information. Observe that this tensor is very sparse! term term scores for 1 st topic term scores for 2 nd topic to SDM'7 Faloutsos, Kolda, Sun 5-11 Kolda, Bader, Kenny, ICDM5 from hub scores for 1 st topic authority scores for 1 st topic authority scores for 2 nd topic hub scores for 2 nd topic SDM'7 Faloutsos, Kolda, Sun

28 Topical HITS (TOPHITS) Main Idea: Extend the idea behind the HITS model to incorporate term (i.e., topical) information. from term to hub scores for 1 st topic term scores for 1 st topic authority scores for 1 st topic term scores for 2 nd topic authority scores for 2 nd topic hub scores for 2 nd topic SDM'7 Faloutsos, Kolda, Sun 5-13 TOPHITS Terms & Authorities on Sample Data term scores term scores to authority scores for 2 nd topic hub scores for 2 scores topic 1 st topic 1st Principal Factor.86 java.sun.com.23 JAVA.18 SUN.38 developers.sun.com 2nd Principal Factor TOPHITS uses 3D analysis to find.17 PLATFORM.16 docs.sun.com.2 NO-READABLE-TEXT.99 SOLARIS.14 see.sun.com.16 the dominant groupings of web.16 FACULTY.6 www2.lehigh.edu 3rd Principal Factor.16 DEVELOPER.14 pages and terms..16 SEARCH.3 EDITION.15 NO-READABLE-TEXT NEWS DOWNLOAD.15 IBM.7 developer.sun.com LIBRARIES 4th Principal Factor www-128.ibm.com.14 INFO SERVICES.6 sunsolve.sun.com COMPUTING.12 WEBSPHERE access1.sun.com.26 INFORMATION SOFTWARE.5.12 LEHIGH iforce.sun.com.24 FEDERAL NO-READABLE-TEXT.12 WEB DEVELOPERWORKS.1 6th Principal.23 CITIZEN Factor w k # unique links using term k.11 LINUX.22 OTHER.19 travel.state.gov.26 PRESIDENT.87 RESOURCES.19 CENTER.18 NO-READABLE-TEXT.18 TECHNOLOGIES.19 LANGUAGES.9 BUSH.16 travel.state.gov 12th Principal Factor.1 DOWNLOADS.15 U.S WELCOME PUBLICATIONS.75 OPTIMIZATION WHITE.8 CONSUMER SOFTWARE U.S.5 FREE DECISION th plato.la.asu.edu Principal Factor.4 HOUSE.7 NEOS ADOBE BUDGET.45 READER.6 TREE.28 PRESIDENTS.4 ACROBAT.5 GUIDE th Principal Factor.11 OFFICE.2 SEARCH.3 FREE.5 WEATHER Tensor PARAFAC ENGINE.3 NO-READABLE-TEXT.24 OFFICE.25 www-fp.mcs.anl.gov CONTROL.29 HERE.5 CENTER.22 lwf.ncdc.noaa.gov 19th Principal Factor.29 COPY.5 ILOG NO-READABLE-TEXT for 2 nd topic ORGANIZATION.22 TAX DOWNLOAD.17 for 1 st topic TAXES travel.state.gov.15 NWS.9.15 SEVERE.7 aviationweather.gov.22 CHILD.15 RETIREMENT FIRE.15 POLICY BENEFITS authority scores for 1 st topic CLIMATE.14 STATE.3 INCOME.3 SDM'7hub for Faloutsos, Kolda, Sun SERVICE.2.13 REVENUE.2 CREDIT.1 from term Conclusion Real data are often in high dimensions with multiple aspects (modes) Matrix and tensor provide elegant theory and algorithms for such data However, many problems are still open skew distribution, anomaly detection, streaming algorithm, distributed/parallel algorithms, efficient out-of-core processing Thank you! Christos Faloutsos Tamara Kolda csmr.ca.sandia.gov/~tgkolda Jimeng Sun SDM'7 Faloutsos, Kolda, Sun 5-15 SDM'7 Faloutsos, Kolda, Sun

Large Graph Mining: Power Tools and a Practitioner s guide

Large Graph Mining: Power Tools and a Practitioner s guide Task 5: Graphs over time & tensors Faloutsos, Miller, Tsourakakis CMU KDD '9 Faloutsos, Miller, Tsourakakis P5-1 Outline Introduction Motivation