A Tutorial on Matrix Approximation by Row Sampling

Similar documents
Sparsification by Effective Resistance Sampling

Graph Sparsification I : Effective Resistance Sampling

to be more efficient on enormous scale, in a stream, or in distributed settings.

Constant Arboricity Spectral Sparsifiers

Constructing Linear-Sized Spectral Sparsification in Almost-Linear Time

Sparsification of Graphs and Matrices

Approximate Gaussian Elimination for Laplacians

Lecture 24: Element-wise Sampling of Graphs and Linear Equation Solving. 22 Element-wise Sampling of Graphs and Linear Equation Solving

An SDP-Based Algorithm for Linear-Sized Spectral Sparsification

Graph Sparsifiers: A Survey

Randomized Numerical Linear Algebra: Review and Progresses

Graphs, Vectors, and Matrices Daniel A. Spielman Yale University. AMS Josiah Willard Gibbs Lecture January 6, 2016

Sketching as a Tool for Numerical Linear Algebra

Graph Sparsifiers. Smaller graph that (approximately) preserves the values of some set of graph parameters. Graph Sparsification

Using Friendly Tail Bounds for Sums of Random Matrices

Approximate Spectral Clustering via Randomized Sketching

l 1 Regression using Lewis Weights Preconditioning and Stochastic Gradient Descent

sublinear time low-rank approximation of positive semidefinite matrices Cameron Musco (MIT) and David P. Woodru (CMU)

Randomized sparsification in NP-hard norms

MAT 585: Johnson-Lindenstrauss, Group testing, and Compressed Sensing

Rank minimization via the γ 2 norm

RandNLA: Randomized Numerical Linear Algebra

Random Matrices: Invertibility, Structure, and Applications

Spectral Sparsification in Dynamic Graph Streams

ORIE 6334 Spectral Graph Theory November 8, Lecture 22

An Approximation Algorithm for Approximation Rank

Spectral Partitiong in a Stochastic Block Model

Sketching as a Tool for Numerical Linear Algebra

17 Random Projections and Orthogonal Matching Pursuit

Very Sparse Random Projections

Lecture 6 September 13, 2016

Lecture 18 Nov 3rd, 2015

Random Methods for Linear Algebra

arxiv: v1 [math.na] 20 Nov 2009

U.C. Berkeley CS270: Algorithms Lecture 21 Professor Vazirani and Professor Rao Last revised. Lecture 21

Matrix Concentration. Nick Harvey University of British Columbia

Approximation norms and duality for communication complexity lower bounds

Elementary maths for GMT

Lecture 9: Matrix approximation continued

Randomized algorithms in numerical linear algebra

Spectral Analysis of Random Walks

16 Embeddings of the Euclidean metric

Yale university technical report #1402.

Hardness Results for Structured Linear Systems

Faster Johnson-Lindenstrauss style reductions

CS 229r: Algorithms for Big Data Fall Lecture 17 10/28

Properties of Expanders Expanders as Approximations of the Complete Graph

A Matrix Expander Chernoff Bound

Accelerated Dense Random Projections

Graph Sparsification III: Ramanujan Graphs, Lifts, and Interlacing Families

User-Friendly Tools for Random Matrices

Low-Rank PSD Approximation in Input-Sparsity Time

Matrix Completion from Fewer Entries

Algorithmic Primitives for Network Analysis: Through the Lens of the Laplacian Paradigm

Lecture 13 October 6, Covering Numbers and Maurey s Empirical Method

Algorithms, Graph Theory, and the Solu7on of Laplacian Linear Equa7ons. Daniel A. Spielman Yale University

Approximate Second Order Algorithms. Seo Taek Kong, Nithin Tangellamudi, Zhikai Guo

CS 435, 2018 Lecture 3, Date: 8 March 2018 Instructor: Nisheeth Vishnoi. Gradient Descent

A NOTE ON SUMS OF INDEPENDENT RANDOM MATRICES AFTER AHLSWEDE-WINTER

Theoretical and empirical aspects of SPSD sketches

Random Coding for Fast Forward Modeling

Fast Approximate Matrix Multiplication by Solving Linear Systems

On the Spectra of General Random Graphs

Graph Sparsification II: Rank one updates, Interlacing, and Barriers

Lecture 6: Expander Codes

Optimality of the Johnson-Lindenstrauss Lemma

Stein s Method for Matrix Concentration

Spectral and Electrical Graph Theory. Daniel A. Spielman Dept. of Computer Science Program in Applied Mathematics Yale Unviersity

A Fast Algorithm For Computing The A-optimal Sampling Distributions In A Big Data Linear Regression

Laplacian Matrices of Graphs: Spectral and Electrical Theory

Fast Random Projections

Sparse Features for PCA-Like Linear Regression

Fast Linear Solvers for Laplacian Systems

arxiv: v2 [stat.ml] 29 Nov 2018

Bounds for all eigenvalues of sums of Hermitian random matrices

Random regular digraphs: singularity and spectrum

Some Useful Background for Talk on the Fast Johnson-Lindenstrauss Transform

Lecture 6: Lattice Trapdoor Constructions

On the exponential convergence of. the Kaczmarz algorithm

Lecture 14: SVD, Power method, and Planted Graph problems (+ eigenvalues of random matrices) Lecturer: Sanjeev Arora

Low Rank Matrix Approximation

HW2 - Due 01/30. Each answer must be mathematically justified. Don t forget your name.

n n matrices The system of m linear equations in n variables x 1, x 2,..., x n can be written as a matrix equation by Ax = b, or in full

On the Exponent of the All Pairs Shortest Path Problem

Subsampling for Ridge Regression via Regularized Volume Sampling

Block Heavy Hitters Alexandr Andoni, Khanh Do Ba, and Piotr Indyk

Sparse Johnson-Lindenstrauss Transforms

Dimensionality reduction of SDPs through sketching

Norms of Random Matrices & Low-Rank via Sampling

Approximate Principal Components Analysis of Large Data Sets

CS168: The Modern Algorithmic Toolbox Lecture #4: Dimensionality Reduction

Estimating Current-Flow Closeness Centrality with a Multigrid Laplacian Solver

Efficient Sampling for Gaussian Graphical Models via Spectral Sparsification

Ma/CS 6b Class 20: Spectral Graph Theory

Chapter 7. Canonical Forms. 7.1 Eigenvalues and Eigenvectors

Lecture 8: Linear Algebra Background

A Unified Theorem on SDP Rank Reduction. yyye

OSNAP: Faster numerical linear algebra algorithms via sparser subspace embeddings

Graphs and Beyond: Faster Algorithms for High Dimensional Convex Optimization

An indicator for the number of clusters using a linear map to simplex structure

Transcription:

A Tutorial on Matrix Approximation by Row Sampling Rasmus Kyng June 11, 018

Contents 1 Fast Linear Algebra Talk 1.1 Matrix Concentration................................... 1. Algorithms for ɛ-approximation of a matrix....................... 1..1 Uniform Sampling for Leverage Score Estimation................ 3 1.3 JL Speed-up......................................... 4 1.4 BSS-sparsification...................................... 5 1

1 Fast Linear Algebra Talk 1.1 Matrix Concentration Consider A R m n. For simplicity, assume m n and rank A = n, and write A = a 1.. a m Definition 1.1 (Spectral ɛ-approximation). We will say that B R m n is an ɛ-approximation of A iff for all x in R n (1 ɛ) A x B x A (1 + ɛ) x. Definition 1. (Leverage Score). The leverage score of the ith row of A is τ i (A (a i ) = max x ) x R n A x. Theorem 1.3 (Leverage Score Sum). i [m] τ i(a ) = rank(a ). Below is a useful version of a Matrix Chernoff concentration result, this one due to Tropp [Tro1]. Important earlier versions were developed by Rudelson [Rud99] and Ahlswede and Winter [AW0]. Theorem 1.4 (Matrix Chernoff). Form S [m] by including each i independently with probability p i min(ɛ τ i (A) log(n/δ), 1). Define the diagonal matrix D R m m D(i, i) = { 1 pi 0 o.w. if i S Then with probability at least 1 δ, DA is an ɛ-approximation of A. Note that the expected number of rows in DA is at most ɛ n log(n/δ). 1. Algorithms for ɛ-approximation of a matrix Suppose we are given a matrix A with m n and we d like to find O(ɛ n log(n/δ)) rows, s.t. à is an ɛ approximation of A (whp). Notice τ i (A (a i ) = max x ) x R n A x = a i (AA ) 1 a i A R m n with m So, we can compute leverage scores if we can compute inverses. But that s expensive! To compute the leverage scores naively would require us to 1. Compute AA. Time O(n ω m ).. Compute (AA ) 1. Time Õ(nω ). 3. Compute C = (AA ) 1 A. Time O(n ω 1 m). 4. Compute a i c i for all i. Time O(nm). Overall, we get O(n ω m ) time to compute all the leverage scores, and given the leverage scores, we can compute à = DA in time O(nnz(A)) O(mn), which is a lower order term.

1..1 Uniform Sampling for Leverage Score Estimation Still, our goal is to compute an approximation à of A with fewer rows. It turns out that in the above algorithm sketch, we can replace the use of AA in Step 1 and onwards with a very crude approximation of the matrix and still get approximate leverage scores that are good enough for sampling. The following definition is helpful: Definition 1.5 (Generalized Leverage Scores). The generalized leverage score of row i of AA w.r.t. B is τi B (A (a i ) = max x ) x R n B x. The following theorem is the basis for a simple and clever algorithm for leverage score estimation. It is due to Cohen et al. [CLM + 15]. Theorem 1.6 (Uniform Sampling for Leverage Score Approximation). Let T [m] denote a uniform random sample of d rows of A. Define the matrix T R m m to be the diagonal indicator for T, i.e. T(i, i) = 1 if i T and all other entries are zero. Then, τ i τ i (A) for all i and τ i def = τ TA i (A) if i T, 1 otherwise. 1+ 1 τ i TA (A) [ n ] E τ i nm d. i=1 Algorithm 1 Repeated Halving Version 1 input: m n matrix A, approximation parameter ɛ output: spectral approximation à consisting of O(ɛ n log n) rescaled rows of A 1: procedure Repeated Halving : Uniformly sample m rows of A to form A 3: If A has > O(n log n) rows, recursively compute a 1/-spectral approximation à of A 4: Compute generalized leverage scores of A w.r.t. à 5: Use these estimates to sample rows of A to form à 6: return à 7: end procedure Remark 1.7. Not addressing how approximation plays into not using A but it s approximation when computing τi TA. but it looks like using c factor overestimates will at most increase the sum by a factor c, because the 1/(1 + 1/x) transformation is monotone, and grows slower than x, so the effect over approximation is in fact decreased a bit. Let s sketch the time required to do this: 1. Compute à by recursion. Time O(top level time), b/c log n levels but geometric decay in time at each level. 3

. Compute à Ã. Time Õ(nω ). 3. Compute (à à ) 1. Time Õ(nω ). 4. Compute C = (à à ) 1 A. Time O(n ω 1 m). 5. Compute a i c i for all i. Time O(nm). Overall, we get O(n ω 1 m) time to compute all the leverage scores, and given the leverage scores, we can compute à = DA in time O(nnz(A)) O(mn), which is a lower order term. 1.3 JL Speed-up But, we can go even faster using, the Johnson-Lindenstrauss transform [JL84] and a clever trick from [SS11]. Theorem 1.8 (Johnson-Lindenstrauss Lemma). Suppose G R r d is random matrix whose entries are N(0, 1) iid, and r 8ɛ log 1/δ JL. Then for any fixed vector b R d with probability at least 1 δ JL, (1 ɛ) b m G b d (1 + ɛ) b It turns out we can use the JL transform to speed up computation of leverage scores: Spielman and Srivastava realized that the JL transform can be used to speed up computation of leverage scores: a i (à à ) 1 a i = a i (à à ) 1 à à (à à ) 1 a i = à (à à ) 1 a i m G à n (à à ) 1 a i Now, we choose δ JL = δ/m, and ɛ = 1/ for generating G with r = O(log(m/δ)) s.t. with probability δ, we get estimates of the leverage scores up to constant factors. Let s plug this into our previous algorithm: Algorithm Repeated Halving Version input: m n matrix A, approximation parameter ɛ output: spectral approximation à consisting of O(ɛ n log n) rescaled rows of A 1: procedure Repeated Halving : Uniformly sample m rows of A to form A 3: If A has > O(n log n) rows, recursively compute a 1/-spectral approximation à of A 4: Compute approximate (JL-based) generalized leverage scores of A w.r.t. à 5: Use these estimates to sample rows of A to form à 6: return à 7: end procedure What s different? Let s again sketch the time required to do this: 4

1. Compute à by recursion. Time O(top level time), b/c log n levels but geometric decay in time at each level.. Compute à Ã. Time Õ(nω ). 3. Compute (à à ) 1. Time Õ(nω ). 4. Compute C = G à (à à ) 1. Time O(r nnz(ã )) O(r nnz(a)) for the G à product, plus O(rn ) time to mutiply the result by (à à ) 1. The entire result is an r n matrix. 5. Compute Ca i for all i. Time O(r nnz(a)). The dominating terms are Õ(nnz(A) + nω ). 1.4 BSS-sparsification The following theorem, due to Batson, Spielman, and Srivastava [BSST13] show that in fact O(n/ɛ ) rows is enough to produce an ɛ-sparsifier. Theorem 1.9 (BSS Sparsifiers). Given A and γ (0, 1), the algorithm BSSsparsify selects a set of n/γ rescaled rows of A to form à which is a γ + γ -approximation of A. The algorithm is deterministic and can be implemented to run in O(γ n 3 m) time. Algorithm 3 BSS sparsification input: m n matrix A, approximation parameter γ output: spectral approximation à consisting of O(ɛ n log n) rescaled rows of A 1: procedure BSSsparsify : Compute H = (AA ) 1/ 3: For each i [m], let v i = Ha i. 4: For convienience, let d = 1/γ d 1 d+1 d 1, and δ L = 1 5: Let ɛ U = d+, ɛ d L = 1 d, δ U = 6: Let u 0 n/ɛ U, l 0 n/ɛ L, V 0 = 0, and S. 7: for t = 1 to dn do 8: u t u t 1 + δ U, l t l t 1 + δ L 9: M t (u t 1 I V t 1 ) 1, M t,+ (u t I V t 1 ) 1, 10: N t (V t 1 l t 1 I) 1, N t,+ (V t 1 l t I) 1 11: Find i s.t. w i v i M t,+ v i Tr(M t) Tr(M t,+ ) + v i M t,+v i 1: Let V t V t + 1 w i v i v i, and S S {i}. 13: end for 14: return à 1 1/ w 1/ i a i1 d dn. 15: end procedure w 1/ i a it where S = {i 1,..., i T }. v i N t,+ v i Tr(N t,+ ) Tr(N t) v i N t,+v i 5

References [AW0] [BSST13] Rudolf Ahlswede and Andreas Winter. Strong converse for identification via quantum channels. IEEE Transactions on Information Theory, 48(3):569 579, 00. Joshua Batson, Daniel A Spielman, Nikhil Srivastava, and Shang-Hua Teng. Spectral sparsification of graphs: theory and algorithms. Communications of the ACM, 56(8):87 94, 013. [CLM + 15] Michael B. Cohen, Yin Tat Lee, Cameron Musco, Christopher Musco, Richard Peng, and Aaron Sidford. Uniform sampling for matrix approximation. In Proceedings of the 015 Conference on Innovations in Theoretical Computer Science, ITCS 15, pages 181 190, New York, NY, USA, 015. ACM. [JL84] [Rud99] [SS11] [Tro1] William Johnson and Joram Lindenstrauss. Extensions of Lipschitz mappings into a Hilbert space. In Conference in modern analysis and probability (New Haven, Conn., 198), volume 6 of Contemporary Mathematics, pages 189 06. American Mathematical Society, 1984. Mark Rudelson. Random vectors in the isotropic position. Journal of Functional Analysis, 164(1):60 7, 1999. Daniel A Spielman and Nikhil Srivastava. Graph sparsification by effective resistances. SIAM Journal on Computing, 40(6):1913 196, 011. Joel A Tropp. User-friendly tail bounds for sums of random matrices. Foundations of computational mathematics, 1(4):389 434, 01. 6