Randomized algorithms for the approximation of matrices

Similar documents
A randomized algorithm for approximating the SVD of a matrix

Subset Selection. Deterministic vs. Randomized. Ilse Ipsen. North Carolina State University. Joint work with: Stan Eisenstat, Yale

Subset Selection. Ilse Ipsen. North Carolina State University, USA

RandNLA: Randomization in Numerical Linear Algebra

Randomized Numerical Linear Algebra: Review and Progresses

Approximate Spectral Clustering via Randomized Sketching

A fast randomized algorithm for approximating an SVD of a matrix

Relative-Error CUR Matrix Decompositions

dimensionality reduction for k-means and low rank approximation

Column Selection via Adaptive Sampling

Lecture 5: Randomized methods for low-rank approximation

Randomized Algorithms in Linear Algebra and Applications in Data Analysis

Subspace sampling and relative-error matrix approximation

EECS 275 Matrix Computation

randomized block krylov methods for stronger and faster approximate svd

Lecture 18 Nov 3rd, 2015

to be more efficient on enormous scale, in a stream, or in distributed settings.

Randomized methods for computing the Singular Value Decomposition (SVD) of very large matrices

Randomized algorithms for matrix computations and analysis of high dimensional data

Fast Dimension Reduction

RandNLA: Randomized Numerical Linear Algebra

sublinear time low-rank approximation of positive semidefinite matrices Cameron Musco (MIT) and David P. Woodru (CMU)

Sketching as a Tool for Numerical Linear Algebra

Applications of Randomized Methods for Decomposing and Simulating from Large Covariance Matrices

Randomized Algorithms for Matrix Computations

Technical Report No September 2009

Low Rank Matrix Approximation

Randomization: Making Very Large-Scale Linear Algebraic Computations Possible

Linear Algebra Review. Vectors

A fast randomized algorithm for overdetermined linear least-squares regression

Tighter Low-rank Approximation via Sampling the Leveraged Element

Review problems for MA 54, Fall 2004.

Review of Some Concepts from Linear Algebra: Part 2

Mathematical Optimisation, Chpt 2: Linear Equations and inequalities

14 Singular Value Decomposition

Sketching as a Tool for Numerical Linear Algebra

OPERATIONS on large matrices are a cornerstone of

1 Singular Value Decomposition and Principal Component

Fast Matrix Computations via Randomized Sampling. Gunnar Martinsson, The University of Colorado at Boulder

Efficiently Implementing Sparsity in Learning

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

UNIT 6: The singular value decomposition.

A Fast Algorithm For Computing The A-optimal Sampling Distributions In A Big Data Linear Regression

Enabling very large-scale matrix computations via randomization

DEN: Linear algebra numerical view (GEM: Gauss elimination method for reducing a full rank matrix to upper-triangular

Parallel Singular Value Decomposition. Jiaxing Tan

Knowledge Discovery and Data Mining 1 (VO) ( )

Recovering any low-rank matrix, provably

Fast Random Projections using Lean Walsh Transforms Yale University Technical report #1390

Randomized methods for approximating matrices

Subsampling for Ridge Regression via Regularized Volume Sampling

Fast Monte Carlo Algorithms for Matrix Operations & Massive Data Set Analysis

This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and

Fast Approximate Matrix Multiplication by Solving Linear Systems

Matrices and Vectors. Definition of Matrix. An MxN matrix A is a two-dimensional array of numbers A =

Principal Component Analysis

7 Principal Component Analysis

SVD, Power method, and Planted Graph problems (+ eigenvalues of random matrices)

Lecture 9: Matrix approximation continued

Background Mathematics (2/2) 1. David Barber

Fundamentals of Engineering Analysis (650163)

Compressed Sensing and Robust Recovery of Low Rank Matrices

Problem # Max points possible Actual score Total 120

CS 229r: Algorithms for Big Data Fall Lecture 17 10/28

Lecture 14: SVD, Power method, and Planted Graph problems (+ eigenvalues of random matrices) Lecturer: Sanjeev Arora

Randomized algorithms in numerical linear algebra

Linear Algebra and Eigenproblems

Randomized Algorithms for Very Large-Scale Linear Algebra

Singular Value Decomposition and Principal Component Analysis (PCA) I

Least Squares. Tom Lyche. October 26, Centre of Mathematics for Applications, Department of Informatics, University of Oslo

NOTES on LINEAR ALGEBRA 1

Fast low rank approximations of matrices and tensors

Dimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas

Practical Linear Algebra: A Geometry Toolbox

Final Review Sheet. B = (1, 1 + 3x, 1 + x 2 ) then 2 + 3x + 6x 2

CUR Matrix Factorizations

Linear Algebra: Matrix Eigenvalue Problems

Low Rank Approximation Lecture 3. Daniel Kressner Chair for Numerical Algorithms and HPC Institute of Mathematics, EPFL

MANY scientific computations, signal processing, data analysis and machine learning applications lead to large dimensional

Normalized power iterations for the computation of SVD

arxiv: v6 [cs.ds] 11 Jul 2012

Random Methods for Linear Algebra

CS168: The Modern Algorithmic Toolbox Lecture #7: Understanding Principal Component Analysis (PCA)

Math 102, Winter Final Exam Review. Chapter 1. Matrices and Gaussian Elimination

Chapter 3 Transformations

Lecture 5 Singular value decomposition

Conceptual Questions for Review

Simple and Deterministic Matrix Sketches

Sparse Features for PCA-Like Linear Regression

Equality: Two matrices A and B are equal, i.e., A = B if A and B have the same order and the entries of A and B are the same.

STA141C: Big Data & High Performance Statistical Computing

STATISTICAL LEARNING SYSTEMS

PCA, Kernel PCA, ICA

CS281 Section 4: Factor Analysis and PCA

15 Singular Value Decomposition

Chapter 7. Linear Algebra: Matrices, Vectors,

Linear Algebra (Review) Volker Tresp 2018

Randomized Algorithms for Very Large-Scale Linear Algebra

CS 246 Review of Linear Algebra 01/17/19

Randomized algorithms for the low-rank approximation of matrices

Transcription:

Randomized algorithms for the approximation of matrices Luis Rademacher The Ohio State University Computer Science and Engineering (joint work with Amit Deshpande, Santosh Vempala, Grant Wang)

Two topics Low-rank matrix approximation (PCA). Subset selection: Approximate a matrix using another matrix whose columns lie in the span of a few columns of the original matrix.

Motivating example: DNA microarray [Drineas, Mahoney] Unsupervised feature selection for classification Data: table of gene expressions (features) v/s patients Categories: cancer types Feature selection criterion: Leverage scores (importance of a given feature in determining top principal components) Empirically: Leverage scores are correlated with information gain, a supervised measure of influence. Somewhat unexpected. Leads to clear separation (clusters) from selected features.

In matrix form: A is m n matrix, m patients, n genes (features), find A CX, where the columns of C are a few columns of A (so X = C + A). They prove error bounds when columns of C are selected at random according to leverage scores (importance sampling).

(P1) Matrix approximation Given m-by-n matrix, find low rank approximation for some norm: A 2 2 F = ij A ij (Frobenius) A 2 = σmax A = max x Ax / x (spectral)

Geometric view Given points in R n, find subspace close to them. Error: Frobenius norm corresponds to sum of squared distances.

Classical solution Best rank-k approximation A k in F 2 and 2: Top k terms of singular value decomposition T k T (SVD): if A = i σ i u i v i then A k = σ i u i v i i=1 Best k-dim. subspace: rowspan(a k ), i.e. Span of top k eigenvectors of A T A. Leads to iterative algorithm. Essentially, in time mn 2.

Want algorithm With better error/time trade-off. Efficient for very large data: Nearly linear time Pass efficient: if data does not fit in main memory, algorithm should not need random access, but only a few sequential passes. Subspace equal to or contained in the span of a few rows (actual rows are more informative than arbitrary linear combinations).

Idea [Frieze Kannan Vempala] Sampling rows. Uniform does not work (e.g. a single non-zero entry) By importance : sample s rows, each independently with probability proportional to squared length.

[FKV] Theorem 1. Let S be a sample of k=² rows where P(row i is picked) / ka i k 2 : Then the span of S contains the rows of a matrix ~A of rank k for which E(kA ~Ak 2 F ) ka A k k 2 F + ²kAk 2 F : This can be turned into an e±cient algorithm: 2 passes, complexity O(kmn=²): (to compute A, SVD in span of S, which is fast because n becomes k/ε).

One drawback of [FKV] Additive error can be large (say if matrix is nearly low rank). Prefer relative error, something like ka ~ Ak 2 F (1 + ²)kA A k k 2 F :

Several ways: [Har-Peled 06] (first linear time relative approximation) [Sarlos 06]: Random projection of rows onto a O(k/ε)-dim. subspace. Then SVD. [Deshpande R Vempala Wang 06] [Deshpande Vempala 06] Volume sampling (rough approximation) + adaptive sampling.

Some more relevant work [Papadimitriou Raghavan Tamaki Vempala 98]: Introduced random projection for matrix approximation. [Achlioptas McSherry 01][Clarkson Woodruff 09] One-pass algorithm. [Woolfe Liberty Rokhlin Tygert 08] [Rokhlin Szlam Tygert 09] Random projection + power iteration to get very fast practical algorithms. Read survey [Halko Martinsson Tropp 09]. D Aspremont, Drineas, Ipsen, Mahoney, Muthukrishnan,

(P2) Algorithmic Problems: Volume sampling and subset selection Given m-by-n matrix, pick set of k rows at random with probability proportional to squared volume of k-simplex spanned by them and origin. [DRVW] (equivalently, squared volume of parallelepiped determined by them)

Volume sampling Let S be k-subset of rows of A [k! vol(conv(0, A S ))] 2 = vol( (A S )) 2 = det(a S A T (*) S ) volume sampling for A is equivalent to: pick k by k principal minor S S of A A T with prob. proportional to det(a S A T S ) For(*): complete A S to a square matrix B by adding orthonormal rows, orthogonal to span(a S ). vol (A S ) 2 = (det B) 2 = det(bb T ) = det µ AS A T S 0 0 I = det(a S A T S )

Original motivation: Relative error low rank matrix approximation [DRVW]: S: k-subset of rows according to volume sampling A k : best rank-k approximation, given by principal components (SVD) S : projection of rows onto rowspan(a S ) =) E S (ka ¼ S (A)k 2 F ) (k + 1)kA A k k 2 F Factor k+1 is best possible [DRVW] Interesting existential result (there exist k rows ). Alg.? Lead to linear time, pass-efficient algorithm for relative approximation of A k [DV]. (1+ ) in span of O * (k/ ) rows

Where does volume sampling come from? No self-respecting architect leaves the scaffolding in place after completing the building. Gauss?

Where does volume sampling come Idea: For picking k out of k + 1 points, k with maximum volume is optimal. For picking 1 out of m, random according to squared length is better than max. length. For k out of m, this suggest volume sampling. from?

Where does volume sampling come from? Why does the algebra work? Idea: When picking 1 out of m random according to squared length, expected error is sum of squares of areas of triangles: E error = s A s 2 t A t 2 id A i, span A s 2 This sum corresponds to certain coefficient of the characteristic polynomial of AA T

Later motivation [BDM, ] (row/column) Subset selection. A refinement of principal component analysis: Given a matrix A, PCA: find k-dim subspace V that minimizes ka ¼ V (A)k 2 F Subset selection: find V spanned by k rows of A. Seemingly harder, combinatorial flavor. ( projects rows)

Why subset selection? PCA unsatisfactory: top components are linear combinations of rows (all rows, generically). Many applications prefer individual, most relevant rows, e.g.: feature selection in machine learning linear regression using only most relevant independent variables out of thousands of genes, find a few that explain a disease

Known results [Deshpande-Vempala] Polytime k!-approximation to volume sampling, by adaptive sampling: pick a row with probability proportional to squared length project all rows orthogonal to it repeat Implies for random k-subset S with that distribution: E S (ka ¼ S (A)k 2 F ) (k + 1)!kA A kk 2 F

Known results [Boutsidis, Drineas, Mahoney] Polytime randomized algorithm to find k-subset S: ka ¼ S (A)k 2 F O(k2 logk)ka A k k 2 F [Gu-Eisenstat] Deterministic algorithm, ka ¼ S (A)k 2 2 (1 + f2 k(n k))ka A k k 2 2 in time O((m + n log f n)n 2 ) Spectral norm: kak 2 = sup x2r n kaxk=kxk

Known results Remember, volume sampling equivalent to sampling k by k minor S S of AA T with probability proportional to (*) det(a S A T S ) [Goreinov, Tyrtishnikov, Zamarashkin] Maximizing (*) over S is good for subset selection. [Çivril, Magdon-Ismail] [see also Koutis 06] But maximizing is NP-hard, even approximately to within an exponential factor.

Results Volume sampling: Polytime exact alg. O(mn log n) (arithmetic ops.) (some ideas earlier in [Houges Krishnapur Peres Virág]) Implies alg. with optimal approximation for subset selection under Frobenius norm. Can be derandomized by conditional expectation. O(mn log n) 1+ approximations to the previous 2 algorithms in nearly linear time, using volume-preserving random projection [M Z].

Results Observation: Bound in Frobenius norm easily implies bound in spectral norm: ka ¼ S (A)k 2 2 ka ¼ S(A)k 2 F (k + 1)kA A k k 2 F using (k + 1)(n k)ka A k k 2 2 kak 2 F = X i ¾ 2 i kak 2 2 = ¾2 max ¾ max = ¾ 1 ¾ 2 ¾ n 0 are the singular values of A

Comparison for subset selection Frobenius norm sq Spectral norm sq [D R V W] k+1 Existential [Despande Vempala] Time (assuming m>n) : exponent of matrix mult. (k+1)! kmn R [Gu Eisenstat] 1+k(n-k) Existential [Gu Eisenstat] 1+f 2 k(n-k) ((m + n log f n)n 2 D [Boutsidis Drineas Mahoney] Find S s.t. ka ¼ S (A)k 2??kA A kk 2? k 2 log k k 2 (n-k) log k (F implies spectral) [Desphande R] k+1 (optimal) (k+1)(n-k) kmn log n D [Desphande R] (1+ )(k+1) (1+ ) (k+1)(n-k) O * (mnk 2 / 2 + m k 2 + 1 / 2 ) R mn 2 R

Proofs: volume sampling Want (w.l.o.g.) k-tuple S of rows of m by n matrix A with probability det(a S A T S ) P S 0 2[m] k det(a S 0A T S 0 )

Proofs: volume sampling Idea: pick S=(S 1, S 2,, S k ) in sequence. Need marginal distribution of S 1 to begin: P(S 1 = i) = tuples with S 1 = i all tuples Remember characteristic polynomial: p AA T (x) = det(xi AA T ) = X i X jc m k (AA T )j = det(a S A T S ) = Sµ[m];jSj=k P S 0 2[m] k ;S 0 1 =i det(a S 0 A T S 0 ) P S 0 2[m] k det(a S 0A T S 0 ) c i (AA T )x i

Proofs: volume sampling So, for P(S 1 = i) = C i = A ¼ Ai (A) P S 0 2[m] k ;S 0 1 =i det(a S 0 A T S 0 ) P S 0 2[m] k det(a S 0A T S 0 ) = (k 1)!kA ik 2 P S 0 µ[m];js 0 j=k 1 det((c i) S 0(C i ) T S 0 ) k! P S 0 µ[m];js 0 j=k det(a S 0 A T S 0 ) = ka ik 2 jc m k+1 (C i C T i )j kjc m k (AA T )j intuition for numerator: j (A 1 ; A 2 ; A 3 )j = ka 1 kj (¼ A? 1 (A 2 ; A 3 ))j

Proofs: volume sampling So: P(S 1 = i) = ka ik 2 jc m k+1 (C i C T i )j kjc m k (AA T )j C i = A ¼ Ai (A) Can be computed in polytime. After S 1, project rows orthogonal to picked row, repeat marginal computation for S 2, (use intuition for numerator) flops : k * m * (m 2 n + m log m)

Proofs: volume sampling Faster: (we assume m > n) use p(aa T ) = x m n p(a T A) as A T A is n by n (smaller than m by m) use rank-1 updates: C T i C i = A T A AT AA i A T i ka i k 2 C i = A 1 ka i k 2 AA ia T i ; A ia T i AT A ka i k 2 + A ia T i AT AA i A T i ka i k 4 : total flops: mn 2 + km(n 2 + n log n) = k m n log n

Even faster Volume sampling only cares about volumes of k-subsets, can get 1+ approximation using a volume preserving random projection [Magen, Zouzias] (generalizing Johnson Lindenstrauss, not same as Feige).

Even faster [Magen Zouzias]: For any A R m n, 1 k n, < 1/2 there is a d =O(k 2-2 log m) s.t. for all S, k-subset of [m]: det A S A T S det ~A S ~A T S (1 + ²) det A SA T S ; ~A = AG; G 2 R n d random Gaussian matrix, scaled k=1 is JL This as preprocessing implies 1+ volume sampling in time O * (mnk 2 / 2 + m k 2 + 1 / 2 )

Recent news [Boutsidis, Drineas, Magdon-Ismail (FOCS 11)]: improved subset selection using [BSS]. [Guruswami, Sinop (SODA 12)]: Volume sampling in time O kmn 2 Relative 1 + ε matrix approximation with one round of r = k 1 + k/ε rows of volume sampling. More precisely, for S a sample of size r k according to volume sampling: E S (ka ¼ S (A)k 2 F ) r + 1 r + 1 k ka A kk 2 F

Open question For a subset of rows S according to volume sampling, lower bound (in expectation?): ¾ min (A S ) Want A S to be well conditioned.