Data Mining and Matrices

Similar documents
Data Mining and Matrices

Using Matrix Decompositions in Formal Concept Analysis

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

ANLP Lecture 22 Lexical Semantics with Dense Vectors

Data Mining and Matrices

DATA MINING LECTURE 8. Dimensionality Reduction PCA -- SVD

Non-Negative Matrix Factorization

Algorithm 805: Computation and Uses of the Semidiscrete Matrix Decomposition

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

CS47300: Web Information Search and Management

Multivariate Statistical Analysis

Latent Semantic Indexing (LSI) CE-324: Modern Information Retrieval Sharif University of Technology

Probabilistic Latent Semantic Analysis

Data Mining Techniques

Manning & Schuetze, FSNLP (c) 1999,2000

Bindel, Fall 2016 Matrix Computations (CS 6210) Notes for

Linear Algebra Section 2.6 : LU Decomposition Section 2.7 : Permutations and transposes Wednesday, February 13th Math 301 Week #4

1 Last time: inverses

Matrices and Vectors

Latent Semantic Analysis. Hongning Wang

Information Retrieval

Singular Value Decompsition

PROBABILISTIC LATENT SEMANTIC ANALYSIS

Latent Semantic Indexing (LSI) CE-324: Modern Information Retrieval Sharif University of Technology

Manning & Schuetze, FSNLP, (c)

Machine Learning. Principal Components Analysis. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Clustering VS Classification

SPARSE signal representations have gained popularity in recent

Dimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas

Matrices and Vectors. Definition of Matrix. An MxN matrix A is a two-dimensional array of numbers A =

Definition 2.3. We define addition and multiplication of matrices as follows.

CS264: Beyond Worst-Case Analysis Lecture #15: Topic Modeling and Nonnegative Matrix Factorization

Linear Algebra and Eigenproblems

SAMPLE OF THE STUDY MATERIAL PART OF CHAPTER 1 Introduction to Linear Algebra

14 Singular Value Decomposition

Data Exploration and Unsupervised Learning with Clustering

MATH 581D FINAL EXAM Autumn December 12, 2016

Lecture 24: Principal Component Analysis. Aykut Erdem May 2016 Hacettepe University

Numerical Methods I Singular Value Decomposition

Linear Algebra Review. Fei-Fei Li

Positive Definite Matrix

SVD, PCA & Preprocessing

Matrix factorization models for patterns beyond blocks. Pauli Miettinen 18 February 2016

CS 143 Linear Algebra Review

Principal Component Analysis

Unsupervised Machine Learning and Data Mining. DS 5230 / DS Fall Lecture 7. Jan-Willem van de Meent

Latent Semantic Analysis. Hongning Wang

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2016

Linear Algebra Background

Lecture Notes 1: Vector spaces

SAMPLE OF THE STUDY MATERIAL PART OF CHAPTER 1 Introduction to Linear Algebra

Semantics with Dense Vectors. Reference: D. Jurafsky and J. Martin, Speech and Language Processing

Linear Algebra Methods for Data Mining

Notes on Latent Semantic Analysis

Lecture Notes 2: Matrices

USING SINGULAR VALUE DECOMPOSITION (SVD) AS A SOLUTION FOR SEARCH RESULT CLUSTERING

Matrix Factorizations over Non-Conventional Algebras for Data Mining. Pauli Miettinen 28 April 2015

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

Conditions for Robust Principal Component Analysis

Chapter 4 & 5: Vector Spaces & Linear Transformations

Generic Text Summarization

Vector spaces. DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis.

Linear Algebra Review. Vectors

Linear Algebra March 16, 2019

Machine learning for pervasive systems Classification in high-dimensional spaces

Principal components analysis COMS 4771

Parallel Singular Value Decomposition. Jiaxing Tan

CPSC 340 Assignment 4 (due November 17 ATE)

MATH Topics in Applied Mathematics Lecture 12: Evaluation of determinants. Cross product.

Lecture 5: Web Searching using the SVD

Introduction to the Tensor Train Decomposition and Its Applications in Machine Learning

Machine Learning, Fall 2009: Midterm

9 Searching the Internet with the SVD

Statistical Inference on Large Contingency Tables: Convergence, Testability, Stability. COMPSTAT 2010 Paris, August 23, 2010

Multiclass Classification-1

1 Last time: determinants

13 Searching the Web with the SVD

Advanced Introduction to Machine Learning CMU-10715

Matrix Operations: Determinant

Leverage Sparse Information in Predictive Modeling

Classification for High Dimensional Problems Using Bayesian Neural Networks and Dirichlet Diffusion Trees


The Singular Value Decomposition

Central Groupoids, Central Digraphs, and Zero-One Matrices A Satisfying A 2 = J

Main matrix factorizations

EECS 275 Matrix Computation

Constrained optimization

Math Linear Algebra Final Exam Review Sheet

Table of Contents. Multivariate methods. Introduction II. Introduction I

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2017


Properties of Matrices and Operations on Matrices

GROUPS. Chapter-1 EXAMPLES 1.1. INTRODUCTION 1.2. BINARY OPERATION

Machine Learning (Spring 2012) Principal Component Analysis

Dimensionality Reduction and Principle Components Analysis

Dimension Reduction and Iterative Consensus Clustering

Clustering-based State Aggregation of Dynamical Networks

Non-Negative Matrix Factorization

Introduction to Tensors. 8 May 2014

Linear Algebra Methods for Data Mining

Transcription:

Data Mining and Matrices 05 Semi-Discrete Decomposition Rainer Gemulla, Pauli Miettinen May 16, 2013

Outline 1 Hunting the Bump 2 Semi-Discrete Decomposition 3 The Algorithm 4 Applications SDD alone SVD + SDD 5 Wrap-Up 2 / 30

An example data 100 200 300 400 500 600 700 100 200 300 400 500 600 700 The data 3 / 30

An example data 3.5 100 3 200 2.5 300 2 400 1.5 500 1 600 0.5 700 0 100 200 300 400 500 600 700 The data after permuting rows and columns 3 / 30

An example data The data in a 3D view Can we find the bumps in the picture automatically (from unpermuted data)? 3 / 30

What is a bump? 3 1 3 A = 2 3 1 3 2 3 I = {1, 3} J = {1, 3} 1 1 x = 0 y = 0 1 1 3 0 3 A xy T = 0 0 0 3 0 3 A submatrix of a matrix A R m n contains some rows of A and some columns of those rows Let I {1, 2,..., m} have the row indices and J {1, 2,..., n} have the column indices of the submatrix If x {0, 1} m has x i = 1 iff i I and y {0, 1} n has y j = 1 iff j J, then xy T {0, 1} m n has (xy T ) ij = 1 iff a ij is in the submatrix A xy T has the values of the submatrix and zeros elsewhere (A B) ij = a ij b ij is the Hadamard matrix product The submatrix is uniform if all (or most) of its values are (approximately) the same Exactly uniform submatrices with value δ can be written as δxy T a bump 4 / 30

The next bump and negative values Assume we know how to find the largest bump of a matrix To find another bump, we can subtract the found bump from the matrix and find the largest bump of the residual matrix But after subtraction we might have negative values in the matrix We can generalize the uniform submatrices to require uniformity only in magnitude Allow characteristic vectors x and y to take values from { 1, 0, 1} If x = ( 1, 0, 1) T and y = (1, 0, 1) T, then δ 0 δ δxy T = 0 0 0 δ 0 δ This allows us to define bumps in matrices with negative values 5 / 30

Outline 1 Hunting the Bump 2 Semi-Discrete Decomposition 3 The Algorithm 4 Applications SDD alone SVD + SDD 5 Wrap-Up 6 / 30

The definition Semi-Discrete Decomposition Given a matrix A R m n, the semi-discrete decomposition (SDD) of A of dimension k is where A X k D k Y T k, X k { 1, 0, 1} m k Y k { 1, 0, 1} n k D k R k k + is a diagonal matrix 7 / 30

Example The data The first component σ 1 u 1 v T 1 using SVD 8 / 30

Example The data The second component σ 2 u 2 v T 2 using SVD The SVD cannot find the bumps 8 / 30

Example The data The first bump d 1 x 1 y T 1 using SDD 8 / 30

Example The data The second bump d 2 x 2 y T 2 using SDD 8 / 30

Example The data The third bump d 3 x 3 y T 3 using SDD 8 / 30

Example The data The fourth bump d 4 x 4 y T 4 using SDD 8 / 30

Example The data The fifth bump d 5 x 5 y T 5 using SDD 8 / 30

Example The data The 5-dimensional SDD approximation X 5 D 5 Y T 5 8 / 30

Properties of SDD The columns of X k and Y k do not need to be linearly independent The same column can be even repeated multiple times The dimension k might need to be large for accurate approximation (compared to SVD) k = min{n, m} is not necessarily enough for exact SDD k = nm is always enough First factors don t necessarily explain much about the matrix SDD factors are local Only affect a certain submatrix, typically not every element SVD factors typically change every value Storing an k-dimensional SDD takes less space than storing rank-k truncated SVD Xk and Y k are ternary and often sparse For every rank-1 layer of an SDD, all non-zero values in the layer have the same magnitude (d ii for layer i) 9 / 30

Interpretation The factor interpretation is not very useful as the factors are not independent A later factor can change just a subset of values already changed by an earlier factor The SDD can be interpret as a form of bi-clustering Every layer (bump) defines a group of rows and columns with homogeneous values in the residual matrix The component interpretation is natural to SDD The SDD is a sum of local bumps SDD doesn t model global phenomena (e.g. noise) well 10 / 30

Outline 1 Hunting the Bump 2 Semi-Discrete Decomposition 3 The Algorithm 4 Applications SDD alone SVD + SDD 5 Wrap-Up 11 / 30

The outline of the algorithm 1 Input: Matrix A R m n, non-negative integer k 2 Output: k-dimensional SDD of A, i.e. matrices X k { 1, 0, 1} m k, Y k { 1, 0, 1} n k, and diagonal D k R k k + 3 R 1 A 4 for i = 1,..., k 1 Select y i { 1, 0, 1} n 2 while not converged 1 Compute x i { 1, 0, 1} m given y i and R i 2 Compute y i given x and R i 3 end while 4 Set d i to the average of R i x i yi T over the non-zero locations of xy T 5 Set x i as the ith column of X i, y i the ith column of Y i, and d i the ith value of D i 6 R i+1 R i d i x i yi T 5 end for 6 return X k, Y k, and D k 12 / 30

Finding the bump Problem: Given R R m n and y { 1, 0, 1} n, find x { 1, 0, 1} m such that R dxy T 2 F is minimized We set d x T Ry/ x 2 2 y 2 2 (the average of R xy T over the non-zero locations of xy T ) We want to minimize the residual norm Set s Ry Task: Find x that maximizes F (x, y) = (x T s) 2 / x 2 2 Maximizing F equals minimizing the residual norm after d is set as above Can be solved optimally by trying 2 m different binary vectors and setting the sign appropriately Solution: Order values s i so that s i1 s i2 s im and set x ij sign(s ij ) for the first J values s i and 0 elsewhere J is the number of nonzeros in x Because we don t know J, we have to try every possibility and select the best Values s i contain the row sums of R from those columns that are selected by y and with sign set accordingly 13 / 30

Selecting the initial vector y There are many ways to select the initial vector: MAX: set y j = 1 for the column j that has the largest squared value of R and rest to zero Intuition: the very largest squared value is probably in the best bump CYC: set y j = 1 for j = (k mod n) + 1 Cycle thru the columns THR: select a unit vector y that satisfies Ry 2 F R 2 F /n The selected column must have a squared sum that s above the average squared sum The selection can be random or columns can be tried one-by-one The CYC and THR can be mixed 14 / 30

Example result The data 5-dimensional SDD 15 / 30

Example result The data The matrix X 5 D 5 Y T 5 A 15 / 30

Normalization Normalization can have a profound effect on SDD Zero centering the columns will change the type of bumps found The bumps in the original data have the largest-magnitude values The bumps in the zero-centered data have the most extreme values Normalizing the variance will make the matrix to have more uniform values and thus changes the bumps Squaring the values will promote smaller bumps of exceptionally high values Square-rooting the values will promote larger bumps of smaller magnitude 16 / 30

Normalization example: zero-centered data Zero-centered data The first bump 17 / 30

Normalization example: zero-centered data Zero-centered data The second bump 17 / 30

Normalization example: zero-centered data Zero-centered data The third bump Note that here red means 0 17 / 30

Normalization example: zero-centered data Zero-centered data 5-dimensional SDD 17 / 30

Normalization example: square-root of data Data after taking element-wise square-root The first bump 18 / 30

Normalization example: square-root of data Data after taking element-wise square-root The second bump 18 / 30

Normalization example: square-root of data Data after taking element-wise square-root The third bump 18 / 30

Normalization example: square-root of data Data after taking element-wise square-root 5-dimensional SDD 18 / 30

Normalization example: squared data Squared data The first bump 19 / 30

Normalization example: squared data Squared data The second bump 19 / 30

Normalization example: squared data Squared data The third bump 19 / 30

Normalization example: squared data Squared data 5-dimensional SDD 19 / 30

Outline 1 Hunting the Bump 2 Semi-Discrete Decomposition 3 The Algorithm 4 Applications SDD alone SVD + SDD 5 Wrap-Up 20 / 30

Outline 1 Hunting the Bump 2 Semi-Discrete Decomposition 3 The Algorithm 4 Applications SDD alone SVD + SDD 5 Wrap-Up 21 / 30

Clustering SDD performs a type of bi-clustering of the matrix Every bump dxy T gives a cluster of rows x i 0 and a cluster of columns y j 0, together with centroid d This is not a partition clustering: the clusters can overlap and not every row or column has to belong to some cluster We can impose an ordering of the bumps based on the values of d i The algorithm usually returns the bumps in that order This ordering can be used to obtain a hierarchical clustering First column of X clusters the rows of A into three sets ( 1, 0, 1), and same for the first column of Y and columns of A The second column of X splits the previous clusters again into three sets Some of these sets can be empty And so on and so forth Distance between two objects in the hierarchical clustering is not the usual dendrogram depth, but depends on whether we use the 0 or ±1 branches 22 / 30

Other applications Image compression (O Leary & Peleg, 1983) A grayscale image is compressed using its SDD The original application, modern image compression techniques are better Latent topic models (Kolda & O Leary, 1998) Used similarly as SVD is used to compute LSA Compute the SDD of the term-document matrix SDD to correlation matrices A bump in the correlation matrix AA T corresponds to rows of A with similar values A bump in A T A corresponds to columns with similar values O Leary & Peleg, 1983; Kolda & O Leary, 1998 23 / 30

Outline 1 Hunting the Bump 2 Semi-Discrete Decomposition 3 The Algorithm 4 Applications SDD alone SVD + SDD 5 Wrap-Up 24 / 30

General approaches Most common way to combine SVD and SDD is to first use SVD to denoise the data and then to compute the SDD on the clean data clean data = truncated SVD (Ak ) SVD is good at finding global structure, SDD at finding local structure Another option is to first compute the A k with SVD and then apply SDD to A k A T k (or AT k A k) SDD finds the objects with similar values The results can be visualized using the first 2 3 columns of U from SVD and the first layers of the hierarchical clustering from SDD 25 / 30

Figure 7. Plot of an SVD of galaxy data. Classifying galaxies 0.2 0.2 0.1 0.1 0.15 U3 0 0.05 0.1 0 0.2 0.05 0.1 0.08 0.06 0.04 0.02 0 0.02 0.04 0.06 0.08 0.1 0.1 0.15 U2 U1 Figure 8. Plot of the SVD of galaxy data, overlaid with the SDD classification. Skillicorn, Color Figures 7 and 8 26 / 30

Finding minerals 0.2 U3 0 0.2 0.2 0.15 0.1 0.1 0.05 0.05 0 0 0.05 0.05 U2 0.1 0.15 0.15 0.1 U1 Figure 11. Plot with position from the SVD, and color and shape labelling from the SDD. Clustering information from SDD added, first bump defines the colour, second the marker. Colour corresponds to the depth of the sample. Skillicorn, Figure 6.7 and Color Figure 11 27 / 30

Outline 1 Hunting the Bump 2 Semi-Discrete Decomposition 3 The Algorithm 4 Applications SDD alone SVD + SDD 5 Wrap-Up 28 / 30

Lessons learned SDD finds the local areas of values with uniform magnitude easier interpretation, orthogonal view to SVD Finding SDD is hard and requires a heuristic Together SVD and SDD provide a strong analysis toolset 29 / 30

Suggested reading Skillicorn, Chapters 5 & 6 Tamara G. Kolda & Diane P. O Leary, 1998. A Semidiscrete Matrix Decomposition for Latent Semantic Indexing in Information Retrieval. ACM Trans. Inf. Syst. 16(4), pp. 322 346 DOI: 10.1145/291128.291131 Tamara G. Kolda & Diane P. O Leary, 2000. Algorithm 805: Computation and Uses of the Semidiscrete Matrix Decomposition. ACM Trans. Math. Software 26(3), pp. 415 435 DOI: 10.1145/358407.358424 30 / 30