Approximate Spectral Clustering via Randomized Sketching

Similar documents
RandNLA: Randomization in Numerical Linear Algebra

Approximate Spectral Clustering via Randomized Sketching

RandNLA: Randomized Numerical Linear Algebra

Randomized algorithms for the approximation of matrices

Randomized Numerical Linear Algebra: Review and Progresses

Sketching as a Tool for Numerical Linear Algebra

Randomized Algorithms for Matrix Computations

to be more efficient on enormous scale, in a stream, or in distributed settings.

Lecture 24: Element-wise Sampling of Graphs and Linear Equation Solving. 22 Element-wise Sampling of Graphs and Linear Equation Solving

Recovering any low-rank matrix, provably

Randomized Algorithms in Linear Algebra and Applications in Data Analysis

dimensionality reduction for k-means and low rank approximation

Sketching as a Tool for Numerical Linear Algebra

A randomized algorithm for approximating the SVD of a matrix

Subset Selection. Deterministic vs. Randomized. Ilse Ipsen. North Carolina State University. Joint work with: Stan Eisenstat, Yale

Random Methods for Linear Algebra

Relative-Error CUR Matrix Decompositions

A Fast Algorithm For Computing The A-optimal Sampling Distributions In A Big Data Linear Regression

A fast randomized algorithm for approximating an SVD of a matrix

Lecture 18 Nov 3rd, 2015

Compressed Sensing and Robust Recovery of Low Rank Matrices

Fast Dimension Reduction

Sketched Ridge Regression:

Column Selection via Adaptive Sampling

Subset Selection. Ilse Ipsen. North Carolina State University, USA

Sketching Structured Matrices for Faster Nonlinear Regression

Tighter Low-rank Approximation via Sampling the Leveraged Element

arxiv: v6 [cs.ds] 11 Jul 2012

Lecture 9: Matrix approximation continued

Efficiently Implementing Sparsity in Learning

Fast Approximate Matrix Multiplication by Solving Linear Systems

EECS 275 Matrix Computation

SVD, Power method, and Planted Graph problems (+ eigenvalues of random matrices)

Subspace Embeddings for the Polynomial Kernel

Randomized Algorithms

CS 229r: Algorithms for Big Data Fall Lecture 17 10/28

Theoretical and empirical aspects of SPSD sketches

Approximate Principal Components Analysis of Large Data Sets

Graph Sparsification I : Effective Resistance Sampling

Sparse Features for PCA-Like Linear Regression

A fast randomized algorithm for overdetermined linear least-squares regression

A Tutorial on Matrix Approximation by Row Sampling

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 61, NO. 2, FEBRUARY

Fast Random Projections using Lean Walsh Transforms Yale University Technical report #1390

Matrices and Vectors. Definition of Matrix. An MxN matrix A is a two-dimensional array of numbers A =

Properties of Matrices and Operations on Matrices

Risi Kondor, The University of Chicago. Nedelina Teneva UChicago. Vikas K Garg TTI-C, MIT

randomized block krylov methods for stronger and faster approximate svd

Lecture 14: SVD, Power method, and Planted Graph problems (+ eigenvalues of random matrices) Lecturer: Sanjeev Arora

Low-Rank PSD Approximation in Input-Sparsity Time

Fast Approximation of Matrix Coherence and Statistical Leverage

Dense Fast Random Projections and Lean Walsh Transforms

Spectral Clustering using Multilinear SVD

Algorithmic Primitives for Network Analysis: Through the Lens of the Laplacian Paradigm

Randomized methods for computing the Singular Value Decomposition (SVD) of very large matrices

Mathematical foundations - linear algebra

Randomized algorithms in numerical linear algebra

Fast Monte Carlo Algorithms for Matrix Operations & Massive Data Set Analysis

sublinear time low-rank approximation of positive semidefinite matrices Cameron Musco (MIT) and David P. Woodru (CMU)

arxiv: v3 [cs.ds] 21 Mar 2013

Applications of Randomized Methods for Decomposing and Simulating from Large Covariance Matrices

Memory Efficient Kernel Approximation

CS60021: Scalable Data Mining. Dimensionality Reduction

Mathematical foundations - linear algebra

Fast Relative-Error Approximation Algorithm for Ridge Regression

Approximate Gaussian Elimination for Laplacians

Fast Matrix Computations via Randomized Sampling. Gunnar Martinsson, The University of Colorado at Boulder

Random Projections for Support Vector Machines

Singular value decomposition (SVD) of large random matrices. India, 2010

Near-Optimal Coresets for Least-Squares Regression

Randomized algorithms for matrix computations and analysis of high dimensional data

Random Projections for Linear Support Vector Machines

Graphs, Vectors, and Matrices Daniel A. Spielman Yale University. AMS Josiah Willard Gibbs Lecture January 6, 2016

Notes on singular value decomposition for Math 54. Recall that if A is a symmetric n n matrix, then A has real eigenvalues A = P DP 1 A = P DP T.

OPERATIONS on large matrices are a cornerstone of

Lecture 1: April 24, 2013

Randomly Sampling from Orthonormal Matrices: Coherence and Leverage Scores

Less is More: Computational Regularization by Subsampling

arxiv: v2 [stat.ml] 29 Nov 2018

Lecture 5: Randomized methods for low-rank approximation

Low Rank Matrix Approximation

Gradient-based Sampling: An Adaptive Importance Sampling for Least-squares

Principal Component Analysis

(a) If A is a 3 by 4 matrix, what does this tell us about its nullspace? Solution: dim N(A) 1, since rank(a) 3. Ax =

Improved Distributed Principal Component Analysis

Fixed-Rank Approximation of a Positive-Semidefinite Matrix from Streaming Data

Statistical Inference on Large Contingency Tables: Convergence, Testability, Stability. COMPSTAT 2010 Paris, August 23, 2010

14 Singular Value Decomposition

Technical Report No September 2009

Final Exam, Linear Algebra, Fall, 2003, W. Stephen Wilson

arxiv: v5 [cs.lg] 17 Apr 2014

Dissertation Defense

Revisiting the Nyström Method for Improved Large-scale Machine Learning

Less is More: Computational Regularization by Subsampling

Normalized power iterations for the computation of SVD

Error Estimation for Randomized Least-Squares Algorithms via the Bootstrap

arxiv: v2 [cs.ds] 17 Feb 2016

Enabling very large-scale matrix computations via randomization

Lecture 2: Linear Algebra Review

Conceptual Questions for Review

Transcription:

Approximate Spectral Clustering via Randomized Sketching Christos Boutsidis Yahoo! Labs, New York Joint work with Alex Gittens (Ebay), Anju Kambadur (IBM)

The big picture: sketch and solve Tradeoff: Speed (depends on the size of Ã) with accuracy (quantified by the parameterε > 0).

Sketching techniques (high level) 1 Sampling: A Ã by picking a subset of the columns of A 2 Linear sketching: A Ã = AR for some matrix R. 3 Non-linear sketching: A Ã (no linear relationship).

Sketching techniques (low level) 1 Sampling: Importance sampling: randomized sampling with probabilities proportional to the norms of the columns of A [Frieze, Kannan, Vempala, FOCS 1998], [Drineas, Kannan, Mahoney, SISC 2006]. Subspace sampling: randomized sampling with probabilities proportional to the norms of the rows of the matrix V k containing the top k right singular vectors of A (leverage-scores sampling) [Drineas, Mahoney, Muthukrishnan, SODA 2006]. Deterministic sampling: Deterministically selecting rows from V k - equivalently columns from A [Batson, Spielman, Srivastava, STOC 2009], [Boutsidis, Drineas, Magdon-Ismail, FOCS 2011]. 2 Linear sketching: Random Projections: Post-multiply A with a random gaussian matrix [Johnson,Lindenstrauss1982]. Fast Random Projections: Post-multiply A with an FFT-type random matrix [Ailon, Chazelle 2006]. Sparse Random Projections: Post-multiply A with a sparse matrix [Clarkson, Woodruff STOC 2013]. 3 Non-linear sketching: Frequent Directions: SVD-type transform. [Liberty, KDD 13], [Ghashami, Phillips, SODA 14]. Other non-linear dimensionality reduction methods such as LLE, ISOMAP etc.

Problems Linear Algebra: 1 Matrix Multiplication [Drineas, Kannan, Rudelson, Virshynin, Woodruff, Ipsen, Liberty, and others] 2 Low-rank Matrix Approx. [Tygert, Tropp, Clarkson, Candes, B., Despande, Vempala, and others] 3 Element-wise Sparsification [Achlioptas, McSherry, Kale, Drineas, Zouzias, Liberty, Karnin, and others] 4 Least-squares [Mahoney, Muthukrishnan, Dasgupta, Kumar, Sarlos, Roklhin, Boutsidis, Avron, and others] 5 Linear Equations with SDD matrices [Spielman, Teng, Koutis, Miller, Peng, Orecchia, Kelner, and others] 6 Determinant of SPSD matrices [Barry, Pace, B., Zouzias and others] 7 Trace of SPSD matrices [Avron, Toledo, Bekas, Roosta-Khorasani, Uri Ascher, and others] Machine Learning: 1 Canonical Correlation Analysis [Avron, B., Toledo, Zouzias] 2 Kernel Learning [Rahimi, Recht, Smola, Sindhwani and others] 3 k-means Clustering [B., Zouzias, Drineas, Magdon-Ismail, Mahoney, Feldman, and others] 4 Spectral Clustering [Gittens, Kambadur, Boutsidis, Strohmer and others] 5 Spectral Graph Sparsification [Batson, Spielman, Srivastava, Koutis, Miller, Peng, Kelner, and others] 6 Support Vector Machines [Paul, B., Drineas, Magdon-Ismail and others] 7 Regularized least-squares classification [Dasgupta, Drineas, Harb, Josifovski, Mahoney]

What approach should we use to cluster these data? 2-dimensional points belonging to 3 different clusters Answer: k-means clustering

k-means optimizes the right metric over this space 2-dimensional points belonging to 3 different clusters P = {x 1, x 2,..., x n } R d. number of clusters k. k-partition of P: a collection S = {S 1,S 2,...,S k } of sets of points. For each set S j, let µ j R d be its centroid. k-means objective function: F(P,S) = n i=1 x i µ(x i ) 2 2 Find the best partition: S opt = arg min S F(P,S).

What approach should we use to cluster these data? Answer: k-means will fail miserably. What else?

Spectral Clustering: Transform the data into a space where k-means would be useful 1-d representation of points from the first dataset in previous picture (this is an eigenvector from an appropriate graph).

Spectral Clustering: the graph theoretic perspective Definition n points {x 1, x 2,..., x n } in d-dimensional space. G(V, E) is the corresponding graph with n nodes. Similarity matrix W R n n W ij = e x i x j 2 σ (for i j); W ii = 0. Let k be the number of clusters. Let x 1, x 2,...,x n R d and k = 2 are given. Find subgraphs of G, denoted as A and B, to minimize: Ncut(A, B) = where cut(a, B) = x i A,x j B W ij; and assoc(a, V) = x i A,x j V cut(a, B) cut(a, B) + assoc(a, V) assoc(b, V), W ij ; assoc(b, V) = x i B,x j V W ij.

Spectral Clustering: the linear algebraic perspective For any G, A, B and partition vector y R n with +1 to the entries corresponding to A and 1 to the entries corresponding to B it is: 4 Ncut(A, B) = y T (D W)y/(y T Dy). Here, D R n n is the diagonal matrix of degree nodes: D ii = j W ij. Definition Given graph G with n nodes, adjacency matrix W, and degrees matrix D find y R n : y T (D W)y y = argmin y R n,y T D1 n y T. Dy

Spectral Clustering: Algorithm for k-partitioning Cluster n points{x 1, x 2,..., x n} into k clusters 1 Construct the similarity matrix W R n n as W ij = e x i x j 2 σ i j) and W ii = 0. 2 Construct D R n n as the diagonal matrix of degree nodes: D ii = j W ij. 3 Construct W = D 1 2 WD 1 2 R n n. 4 Find the largest k eigenvectors of W and assign them as columns to a matrix Y R n k. 5 Apply k-means clustering on the rows of Y, and cluster the original points accordingly. In a nutshell, compute the top k eigenvectors of W and then apply k-means on the rows of the matrix containing those eigenvectors. (for

Spectral Clustering via Randomized Sketching Cluster n points{x 1, x 2,..., x n} into k clusters 1 Construct the similarity matrix W R n n as W ij = e x i x j 2 σ i j) and W ii = 0. 2 Construct D R n n as the diagonal matrix of degree nodes: D ii = j W ij. 3 Construct W = D 1 2 WD 1 2 R n n. 4 Let Ỹ Rn k contain the left singular vectors of B = ( W W T ) p WS, with p 0, and S R n k being a matrix with i.i.d random Gaussian variables. 5 Apply k-means clustering on the rows of Ỹ, and cluster the original data points accordingly. In a nutshell, approximate the top k eigenvectors of W and then apply k-means on the rows of the matrix containing those eigenvectors. (for

Related work The Nystrom method: Uniform random sampling of the similarity matrix W and then compute the eigenvectors. [Fowlkes et al. 2004] The Spielman-Teng iterative algorithm: Very strong theoretical result based on their fast solvers for SDD systems of linear equations. Complex algorithm to implement. [2009] Spectral clustering via random projections: Reduce the dimensions of the data points before forming the similarity matrix W. No theoretical results are reported for this method. [Sakai and Imiya, 2009]. Power iteration clustering: Like our idea but for the k = 2 case. No theoretical results reported. [Lin, Cohen, ICML 2010] Other approximation algorithms: [Yen et al. KDD 2009]; [Shamir and Tishby, AISTATS 2011]; [Wang et al. KDD 2009 ]

Approximation Framework for Spectral Clustering Assume that Y Ỹ 2 ε. For all i = 1 : n, let y T i, ỹ T i R 1 k be the ith rows of Y, Ỹ. Then, y i ỹ i 2 Y Ỹ 2 ε. Clustering the rows of Y and the rows of Ỹ with the same method should result to the same clustering. A distance-based algorithm such as k-means would lead to the same clustering as ε 0. This is equivalent to saying that k-means is robust to small perturbations to the input.

Approximation Framework for Spectral Clustering The rows of Ỹ and ỸQ, where Q is some square orthonormal matrix, are clustered identically. Definition (Closeness of Approximation) Y and Ỹ are close for clustering purposes if there exists a square orthonormal Q such that Y ỸQ 2 ε.

This is really a problem of bounding subspaces Lemma There is an orthonormal matrix Q R n k (Q T Q = I k ) such that: Y ỸQ 2 2 2k YYT ỸỸT 2 2. YY T ỸỸT 2 2 corresponds to the cosine of the principal angle between span(y) and span(ỹ). Q is the solution of the following Procrustes Problem : min Y ỸQ F Q

The Singular Value Decomposition (SVD) Let A be an m n matrix with rank(a) = ρ and k ρ. A = U A Σ A VA T = ( ( ) Σk 0 U k U ρ k }{{} 0 Σ ρ k m ρ } {{ } ρ ρ )( V T k V T ρ k ). }{{} ρ n U k : m k matrix of the top-k left singular vectors of A. V k : n k matrix of the top-k right singular vectors of A. Σ k : k k diagonal matrix of the top-k singular values of A.

A structural result Theorem Given A R m n, let S R n k be such that and Let p 0 be an integer and let rank(a k S) = k rank(v T k S) = k. γ p = Σ 2p+1 ρ k VT ρ k S(V T k S) 1 Σ (2p+1) k 2. Then, for Ω 1 = (AA T ) p AS, and Ω 2 = A k, we obtain Ω 1 Ω + 1 Ω 2Ω + 2 2 2 = γ2 p 1+γp 2.

Some derivations lead to final result γ p Σ 2p+1 ρ k 2 V T ρ k S 2 (V T k S) 1 2 Σ (2p+1) k ) ( ) 2p+1 σk+1 σ max (V T ρ k S = ( ) σ k σ min V T k S ) ( ) 2p+1 σk+1 σ max (V T S ) σ k σ min (V T k S ( ) 2p+1 σk+1 4 n k σ k δ/ k ( ) 2p+1 σk+1 = 4δ 1 k(n k). σ k 2

Random Matrix Theory Lemma (The norm of a random Gaussian Matrix) Let A R n m be a matrix with i.i.d. standard Gaussian random variables, where n m. Then, for every t 4, P{σ 1 (A) tn 1 2} e nt2 /8. Lemma (Invertibility of a random Gaussian Matrix) Let A R n n be a matrix with i.i.d. standard Gaussian random variables. Then, for any δ > 0 : P{σ n (A) δn 1 2 } 2.35δ.

Main Theorem Theorem If for some ε > 0 and δ > 0 we choose ( W) p 1 2 ln(4ε 1 δ 1 k(n k)) ln 1 σ k ), σ k+1 ( W then with probability at least 1 e n 2.35δ, Y ỸQ 2 2 ε2 1+ε 2 = O(ε2 ).