Lecture 9: Matrix approximation continued

Similar documents
to be more efficient on enormous scale, in a stream, or in distributed settings.

Lecture 18 Nov 3rd, 2015

Sketching as a Tool for Numerical Linear Algebra

Compressed Sensing and Robust Recovery of Low Rank Matrices

CS 229r: Algorithms for Big Data Fall Lecture 17 10/28

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Using Friendly Tail Bounds for Sums of Random Matrices

Lecture 14: SVD, Power method, and Planted Graph problems (+ eigenvalues of random matrices) Lecturer: Sanjeev Arora

dimensionality reduction for k-means and low rank approximation

Column Selection via Adaptive Sampling

Singular Value Decomposition

Randomized Numerical Linear Algebra: Review and Progresses

Sparse Features for PCA-Like Linear Regression

randomized block krylov methods for stronger and faster approximate svd

Subset Selection. Deterministic vs. Randomized. Ilse Ipsen. North Carolina State University. Joint work with: Stan Eisenstat, Yale

RandNLA: Randomized Numerical Linear Algebra

Simple and Deterministic Matrix Sketches

Lecture 3. Random Fourier measurements

Uniform Uncertainty Principle and signal recovery via Regularized Orthogonal Matching Pursuit

sublinear time low-rank approximation of positive semidefinite matrices Cameron Musco (MIT) and David P. Woodru (CMU)

Approximate Spectral Clustering via Randomized Sketching

SVD, Power method, and Planted Graph problems (+ eigenvalues of random matrices)

arxiv: v6 [cs.ds] 11 Jul 2012

Robust Principal Component Analysis

CSC 576: Variants of Sparse Learning

Randomized algorithms for the approximation of matrices

arxiv: v2 [cs.ds] 17 Feb 2016

Random Matrices: Invertibility, Structure, and Applications

Efficiently Implementing Sparsity in Learning

Dimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas

Principal Component Analysis

A fast randomized algorithm for approximating an SVD of a matrix

A Simple Algorithm for Clustering Mixtures of Discrete Distributions

Data Mining Lecture 4: Covariance, EVD, PCA & SVD

Subset Selection. Ilse Ipsen. North Carolina State University, USA

Solution-recovery in l 1 -norm for non-square linear systems: deterministic conditions and open questions

Homework 1. Yuan Yao. September 18, 2011

Sparsification by Effective Resistance Sampling

CS168: The Modern Algorithmic Toolbox Lecture #10: Tensors, and Low-Rank Tensor Recovery

Lecture: Face Recognition and Feature Reduction

EECS 275 Matrix Computation

Lecture 18 Oct. 30, 2014

Fast Approximate Matrix Multiplication by Solving Linear Systems

U.C. Berkeley CS270: Algorithms Lecture 21 Professor Vazirani and Professor Rao Last revised. Lecture 21

COMP6237 Data Mining Covariance, EVD, PCA & SVD. Jonathon Hare

Lecture 12: Randomized Least-squares Approximation in Practice, Cont. 12 Randomized Least-squares Approximation in Practice, Cont.

Lecture Notes 10: Matrix Factorization

Linear Subspace Models

17 Random Projections and Orthogonal Matching Pursuit

Section 3.9. Matrix Norm

Near-Optimal Coresets for Least-Squares Regression

Lecture: Face Recognition and Feature Reduction

STAT 309: MATHEMATICAL COMPUTATIONS I FALL 2017 LECTURE 5

Reconstruction from Anisotropic Random Measurements

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

A Cross-Associative Neural Network for SVD of Nonsquared Data Matrix in Signal Processing

Randomized algorithms for the low-rank approximation of matrices

NORMS ON SPACE OF MATRICES

Information-Theoretic Limits of Matrix Completion

ECE 8201: Low-dimensional Signal Models for High-dimensional Data Analysis

Lecture 5 Singular value decomposition

Fall TMA4145 Linear Methods. Exercise set Given the matrix 1 2

Exact Low-rank Matrix Recovery via Nonconvex M p -Minimization

Lecture 5 : Projections

Recovering any low-rank matrix, provably

Conditions for Robust Principal Component Analysis

Normalized power iterations for the computation of SVD

Matrix Completion from a Few Entries

Using the Johnson-Lindenstrauss lemma in linear and integer programming

Stein s Method for Matrix Concentration

Compressed Sensing and Sparse Recovery

OPERATIONS on large matrices are a cornerstone of

EE 381V: Large Scale Learning Spring Lecture 16 March 7

ORIE 6334 Spectral Graph Theory November 8, Lecture 22

Tighter Low-rank Approximation via Sampling the Leveraged Element

Lecture Notes 9: Constrained Optimization

arxiv: v1 [math.na] 1 Sep 2018

Dissertation Defense

Dimensionality Reduction Notes 3

A randomized algorithm for approximating the SVD of a matrix

Multi-Linear Mappings, SVD, HOSVD, and the Numerical Solution of Ill-Conditioned Tensor Least Squares Problems

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 11, NOVEMBER On the Performance of Sparse Recovery

Lecture 2: Linear Algebra Review

Linear Least Squares. Using SVD Decomposition.

Randomized algorithms in numerical linear algebra

An algebraic perspective on integer sparse recovery

A Fast Algorithm For Computing The A-optimal Sampling Distributions In A Big Data Linear Regression

CSC Linear Programming and Combinatorial Optimization Lecture 10: Semidefinite Programming

GI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis. Massimiliano Pontil

Lecture 9: Low Rank Approximation

Strengthened Sobolev inequalities for a random subspace of functions

MATH 612 Computational methods for equation solving and function minimization Week # 2

Methods for sparse analysis of high-dimensional data, II

c i r i i=1 r 1 = [1, 2] r 2 = [0, 1] r 3 = [3, 4].

SUBSET SELECTION ALGORITHMS: RANDOMIZED VS. DETERMINISTIC

Singular Value Decomposition

Improved Practical Matrix Sketching with Guarantees

1 Regression with High Dimensional Data

Solutions to Exam I MATH 304, section 6

Introduction to Numerical Linear Algebra II

Transcription:

0368-348-01-Algorithms in Data Mining Fall 013 Lecturer: Edo Liberty Lecture 9: Matrix approximation continued Warning: This note may contain typos and other inaccuracies which are usually discussed during class. Please do not cite this note as a reliable source. If you find mistakes, please inform me. Recap Last class was dedicated to approximating matrices by sampling individual entries from them. We used the following useful concentration result for sums of random matrices. Lemma 0.1 (Matrix Bernstein Inequality [1]). Let X 1,..., X s be independent m n matrix valued random variables such that k [s] E[X k ] 0 and X k R Set σ max{ k E[X kx T k ], k E[XT k X k] } then Pr[ X k > t] (m + n)e t σ +Rt/3 We obtained an approximate sampled matrix B which was sparser than A. For example, for a matrix A containing values in {1, 0, 1} it sufficed that B contains only s entries and ( ) nr log(n/δ) s O. Moreover, we got that A B ε A. We also claimed that we can compute the PCA projection of B, Pk B instead of that of A, P k and still have: σ k+1 A P k A A P B k A σ k+1 + ε A In this class we will reduce the amount of space needed to be independent of n. We will do this by approximating AA T directly. Here we follow the ideas in [] and give a much simpler proof which, unfortunately, obtains slightly worse bounds. PCA with an approximate covariance matrix Here we see that approximating BB T AA T is sufficient in order to compute an approximate PCA projection (Lemma 3.8 in []). Let Pk B denote the projection on the top k left singular values of B. ε A Pk B A sup xa xpk B A (1) x, x 1 sup x null(p B k ), x 1 xa () sup x null(p B k ), x 1 xaa T (3) sup x(aa T BB T ) + xbb T (4) x null(pk B), x 1 AA T BB T + σ k+1 (BB T ) (5) AA T BB T + σ k+1 (AA T ) (6) 1

The last transition is due to the fact that σ k+1 (BB ) σ k+1 (AA T ) + AA T BB T. By taking the square root and recalling that σ k+1 (AA T ) σk+1 (A) σ k+1 we get: A Pk B A σk+1 + AAT BB T σ k+1 + AA T BB T Therefore, we have that AA T BB T ε A A P B k A σ k+1 + ε A Column subset selection by sampling From this point on, our goal is to find a matrix B such that AA T BB T is small and B is as sparse or small as possible. Note that AA T A j A j where we denote by A j the j th column of the matrix A. Let Computing the expectation of BB T we have B A j / p j with probability p j E[BB T ] p j (A j / p j )(A j / p j ) T A j A T j AA T Clearly, we cannot hope to approximate the matrix A with only one column. We therefore define B to be s such sampled columns from A side by side. B 1 s [B 1... B s ] Just to clarify, B is an m s matrix containing s columns from A (rescaled) and B k A j / p j with probability p j. Computing the expectation of BB T we get that E[BB T ] E[ 1 s B kbk T ] E[B k Bk T ] AA T k1 We are now ready to use the matrix Bernstein inequality above: BB T AA T k1 1 s (B kb T k AA T ) X k To make things simpler we pick p j A j / A F. In words, the columns are picked with probability proportional to their squared norm. R max X k k1 n max 1 s A ja T j /p j + 1 s AAt 1 s A F + 1 s A σ k E[X k X T k ] 1 s E[B kb T k B k B T k AA T AA T ] (7) 1 n s p j A j A T j A j A T j /p j + 1 s A 4 1 s A F A + 1 s A 4 (8)

Plugging both expressions into the matrix Chernoff bound above we get that Pr[ BB T AA T > ε A /] me sε 4 A 4 /4 A F A + A 4 +ε A A F /3+ε A 4 /3 By using that A F A and that ε 1 and denoting the numeric rank of A by r A F / A we can simplify this to be Pr[ BB T AA T > ε A /] me sε4 16r To conclude, it is enough to sample s 16r log(m/δ) ε 4 columns from A (with probability proportional to their squared norm) to form a matrix B such that BB T AA T ε A / with probability at least 1 δ. According to above this also gives us that A Pk BA σ k+1 + ε A which completes our claim. Note that the number of non zeros in B is bounded by s m which is independent of n, the number of columns. This is potentially significantly better than the results obtained in last week s class. Remark 0.1. More elaborate column selection algorithms exist which provide better approximation but these will not be discussed here. See [3] for the latest result that I am aware of. Deterministic Lossy SVD I this section we will see that the above approximation can be improved using a simple trick [4]. The algorithm keeps an m s sketch matrix B which is updated every time a new column from the input matrix A is added. It maintains the invariant that the last column in the sketch is always zero. When a new input row is added it is places in the last (all zeros) column of the sketch. Then, using its SV D the sketch is rotated from the right so that its columns are orthogonal. Finally, the sketch column norms are shrunk so that the last column is again all zeros. In the algorithm we denote by [U, Σ, V ] SV D(B) the Singular Value Decomposition of B. We use the convention that UΣV T B, U T U V T V I, and Σ diag([σ 1,..., σ s ]), σ 1... σ s. The notation I stands for the s s identity matrix while B s denotes the s th column of B (similarly A j ). Input: s, A R m n B 0 all zeros matrix R m s for i [n] do C i B i 1 Cs i A i [U i, Σ i, V i ] SV D(C i ) D i U i Σ i δ i Σ i s,s W i I Σ δi B i D i W i end for Return: B B n Put A i in the last (all zeros) column of C i Lemma 0.. Let B be the output of the above algorithm for a matrix A and an integer s then: AA T BB T A F /s Proof. We start by bounding AA T BB T from above: AA T BB T max x 1 (xt AA T x x T BB T x) max x 1 ( xt A x T B ). 3

We open this with a simple telescopic sum x T B n xt B i x T B i 1 and by replacing x T A n (xt A j ). x T A x T B [(x T A j ) ( x T B i x T B i 1 )] (9) [((x T A j ) + x T B i 1 ) x T B i ] (10) [ x T C i x T B i ] (11) [ x T D i x T D i W i ] (1) x T [D i (D i ) T D i (W i ) (D i ) T ]x (13) x T [δi D i (Σ i ) (D i ) T ]x (14) x T [δi U i (U i ) T ]x (15) δi U i (U i ) T δi (16) For this to mean anything we have to bound the term n δ i. We do this computing the Frobenius norm of B. B n F B i F B i 1 F (17) i1 [ B i F D i F ] + [ D i F C i F ] + [ C i F B i 1 F ] (18) i1 Let us deal with each term separately: Putting these together we get B i F D i F tr(b i (B i ) T D i (D i ) T ) tr(δ i U i (U i ) ) sδ i (19) D i F C i F 0 (0) C i F B i 1 F A i (1) B F B n F A F s δ i Since B F 0 we conclude that δi A F /s Combining with the above: BB T AA T A F /s Finally, we recall that to achieve the approximation guarantee A P B k A σ k+1 + ε A it is sufficient to require BB T AA T ε A / which is obtained when This completes the claim. s r ε 4

Discusion Note that while the deterministic lossy SVD procedure requires less space it also requires more operations per column insertion. Namely, it computes the SVD of the sketch in every iteration. This can be somewhat improved (see [4]) but it is still far from being as efficient as column sampling. Making this algorithm faster while maintaining its approximation guaranty is an open problem. Another interesting problem is to make this algorithm take advantage of the matrix sparsity. References [1] Emmanuel Candès and Benjamin Recht. Exact matrix completion via convex optimization. Commun. ACM, 55(6):111 119, June 01. [] Mark Rudelson and Roman Vershynin. Sampling from large matrices: An approach through geometric functional analysis. J. ACM, 54(4), July 007. [3] Christos Boutsidis, Petros Drineas, and Malik Magdon-Ismail. Near optimal column-based matrix reconstruction. In Proceedings of the 011 IEEE 5nd Annual Symposium on Foundations of Computer Science, FOCS 11, pages 305 314, Washington, DC, USA, 011. IEEE Computer Society. [4] Edo Liberty. Simple and deterministic matrix sketching. CoRR, abs/106.0594, 01. 5