Lecture 12: Randomized Least-squares Approximation in Practice, Cont. 12 Randomized Least-squares Approximation in Practice, Cont.

Similar documents
Lecture 24: Element-wise Sampling of Graphs and Linear Equation Solving. 22 Element-wise Sampling of Graphs and Linear Equation Solving

Lecture 7: Sampling/Projections for Least-squares Approximation, Cont. 7 Sampling/Projections for Least-squares Approximation, Cont.

Lecture: Local Spectral Methods (3 of 4) 20 An optimization perspective on local spectral methods

U.C. Berkeley CS270: Algorithms Lecture 21 Professor Vazirani and Professor Rao Last revised. Lecture 21

Randomized Algorithms for Matrix Computations

GI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis. Massimiliano Pontil

Lecture: Local Spectral Methods (2 of 4) 19 Computing spectral ranking with the push procedure

Lecture: Local Spectral Methods (1 of 4)

Theoretical and empirical aspects of SPSD sketches

Physics 202 Laboratory 5. Linear Algebra 1. Laboratory 5. Physics 202 Laboratory

BLENDENPIK: SUPERCHARGING LAPACK'S LEAST-SQUARES SOLVER

CS 229r: Algorithms for Big Data Fall Lecture 17 10/28

Blendenpik: Supercharging LAPACK's Least-Squares Solver

Numerical Methods I Non-Square and Sparse Linear Systems

Linear and Sublinear Linear Algebra Algorithms: Preconditioning Stochastic Gradient Algorithms with Randomized Linear Algebra

DS-GA 1002 Lecture notes 10 November 23, Linear models

Numerical Methods I Eigenvalue Problems

The University of Texas at Austin Department of Electrical and Computer Engineering. EE381V: Large Scale Learning Spring 2013.

A Fast Algorithm For Computing The A-optimal Sampling Distributions In A Big Data Linear Regression

Sketching as a Tool for Numerical Linear Algebra

Lecture 9: Numerical Linear Algebra Primer (February 11st)

18.06SC Final Exam Solutions

Singular Value Decomposition

Numerical Linear Algebra Primer. Ryan Tibshirani Convex Optimization /36-725

Lecture 8 : Eigenvalues and Eigenvectors

Numerical Linear Algebra Primer. Ryan Tibshirani Convex Optimization

Computational Methods. Eigenvalues and Singular Values

arxiv: v1 [stat.me] 23 Jun 2013

Orthonormal Transformations and Least Squares

Pseudoinverse & Moore-Penrose Conditions

Numerische Mathematik

Lecture notes: Applied linear algebra Part 1. Version 2

Section 6.4. The Gram Schmidt Process

Iterative Methods for Solving A x = b

RandNLA: Randomization in Numerical Linear Algebra: Theory and Practice

Orthogonal iteration to QR

In this section again we shall assume that the matrix A is m m, real and symmetric.

ORIE 6300 Mathematical Programming I August 25, Recitation 1

Preliminary Examination, Numerical Analysis, August 2016

AMS526: Numerical Analysis I (Numerical Linear Algebra)

Supremum of simple stochastic processes

Linear Algebra. and

Lecture 14: SVD, Power method, and Planted Graph problems (+ eigenvalues of random matrices) Lecturer: Sanjeev Arora

Maths for Signals and Systems Linear Algebra in Engineering

Some preconditioners for systems of linear inequalities

Math 102, Winter Final Exam Review. Chapter 1. Matrices and Gaussian Elimination

QR-decomposition. The QR-decomposition of an n k matrix A, k n, is an n n unitary matrix Q and an n k upper triangular matrix R for which A = QR

Problem # Max points possible Actual score Total 120

Math 407: Linear Optimization

Linear Systems. Carlo Tomasi

Least Squares Optimization

MIT Final Exam Solutions, Spring 2017

Unit 2, Section 3: Linear Combinations, Spanning, and Linear Independence Linear Combinations, Spanning, and Linear Independence

Numerical Methods in Matrix Computations

REVIEW FOR EXAM III SIMILARITY AND DIAGONALIZATION

arxiv: v1 [math.na] 5 May 2011

There are two things that are particularly nice about the first basis

AMS526: Numerical Analysis I (Numerical Linear Algebra) Lecture 23: GMRES and Other Krylov Subspace Methods; Preconditioning

Review problems for MA 54, Fall 2004.

NORMS ON SPACE OF MATRICES

RandNLA: Randomized Numerical Linear Algebra

EIGENVALUE PROBLEMS. Background on eigenvalues/ eigenvectors / decompositions. Perturbation analysis, condition numbers..

COMP 558 lecture 18 Nov. 15, 2010

dimensionality reduction for k-means and low rank approximation

Linear Algebra Massoud Malek

Least Squares Optimization

Lecture 9: Matrix approximation continued

Preconditioned Parallel Block Jacobi SVD Algorithm

Lecture 10. Lecturer: Aleksander Mądry Scribes: Mani Bastani Parizi and Christos Kalaitzis

Math 416, Spring 2010 Gram-Schmidt, the QR-factorization, Orthogonal Matrices March 4, 2010 GRAM-SCHMIDT, THE QR-FACTORIZATION, ORTHOGONAL MATRICES

CS168: The Modern Algorithmic Toolbox Lecture #8: How PCA Works

Applications of Randomized Methods for Decomposing and Simulating from Large Covariance Matrices

Scientific Computing: Solving Linear Systems

Applied Mathematics 205. Unit V: Eigenvalue Problems. Lecturer: Dr. David Knezevic

High Performance Nonlinear Solvers

SVD, Power method, and Planted Graph problems (+ eigenvalues of random matrices)

Computation of eigenvalues and singular values Recall that your solutions to these questions will not be collected or evaluated.

Course Notes: Week 1

Orthonormal Transformations

CS 542G: Robustifying Newton, Constraints, Nonlinear Least Squares

Linear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept,

2. Review of Linear Algebra

Linear Systems. Carlo Tomasi. June 12, r = rank(a) b range(a) n r solutions

14 Singular Value Decomposition

Mathematical Optimisation, Chpt 2: Linear Equations and inequalities

Lecture 8: Linear Algebra Background

7 Principal Component Analysis

Lecture 6, Sci. Comp. for DPhil Students

A Statistical Perspective on Algorithmic Leveraging

Review Let A, B, and C be matrices of the same size, and let r and s be scalars. Then

arxiv: v1 [cs.ds] 16 Aug 2016

Spectral Graph Theory Lecture 2. The Laplacian. Daniel A. Spielman September 4, x T M x. ψ i = arg min

PRACTICE FINAL EXAM. why. If they are dependent, exhibit a linear dependence relation among them.

to be more efficient on enormous scale, in a stream, or in distributed settings.

EECS 275 Matrix Computation

ITERATIVE HESSIAN SKETCH WITH MOMENTUM

Chapter 2: Matrix Algebra

Lecture: Modeling graphs with electrical networks

Data Mining Lecture 4: Covariance, EVD, PCA & SVD

sublinear time low-rank approximation of positive semidefinite matrices Cameron Musco (MIT) and David P. Woodru (CMU)

Transcription:

Stat60/CS94: Randomized Algorithms for Matrices and Data Lecture 1-10/14/013 Lecture 1: Randomized Least-squares Approximation in Practice, Cont. Lecturer: Michael Mahoney Scribe: Michael Mahoney Warning: these notes are still very rough. They provide more details on what we discussed in class, but there may still be some errors, incomplete/imprecise statements, etc. in them. 1 Randomized Least-squares Approximation in Practice, Cont. We continue with the disucssion from last time. There is no new reading, just the same as last class. Today, we will focus on three things. We will describe condition numbers and how RandNLA algorithms can lead to good preconditioning. We will describe two different ways that randomness can enter into the parameterization of RandNLA problems. We will describe the Blendenpik RandNLA LS solver. 1.1 Condition numbers and preconditioning in RandNLA algorithms Recall that we are interested in the conditioning quality of randomized sketches constructed by RandNLA sampling and projection algorithms. For simplicity of comparison with the Blendenpik paper, I ll state the results as they are stated in that paper, i.e., with a Hadamard-based projection, and then I ll point out the generalization e.g., to leverage score sampling, to other types of projections, etc.) to other related RandNLA sketching methods. Recall that after pre-multiplying by the randomized Hadamard transform H, the leverage scores of HA are roughly uniform, and so one can sample uniformly. Here is a definition we have mentioned before. Definition 1 Let A R n d be a full rank matrix, with n > d and let U R n d be an orthogonal matrix for spana). Then, if U i) is the i th row of U, the coherence of A is µ A) = max i [n] U i) That is, the coherence is up to a scaling that isn t standardized in the literature equal to the largest leverage score. Equivalently, up to the same scaling, it equals the largest diagonal element 1

of P A = A A T A ) A T, the projection matrix onto the column span of A. Defined this way, i.e., not normalized to be a probability distribution, possible values for the coherence are Thus, with this normalization: d µ A) 1. n If µ A) d n, then the coherence is small, and all the leverage scores are roughly uniform. If µ A) 1, then the coherence is large, and the leverage score distribution is very nonuniform in that there is at least one very large leverage score). The following result which is parameterized for a uniform sampling process) describes the relationship between the coherence µ A), the sample size r, and the condition number of the preconditioned system. Lemma 1 Let A R n d be a full rank matrix, and let S R r n be a uniform sampling operator. Let τ = C mµ A) logr)/r, where C is a constant in the proof. Assume that δ 1 τ < 1, where δ is a failure probability. Then, with probability 1 δ, we have that rank SA) = d, and is the QR decomposition of SA is SA = Q R, then κ A R 1) = 1 + δ 1 τ 1 δ 1 τ. Before proceeding with the proof, here are several things to note about this result. We can obtain a similar result on the condition number if we sample non-uniformly based on the leverage scores, and in this case the coherence µ A) which could be very large, rendering the results as stated trivial) does not enter into the expression. This is of interest more generally, but we ll state the result for uniform sampling for now. The reason is that the Blendenpik solver does a random projection which uniformizes approximately) the leverage scores, i.e., it preprocesses the input matrix to have a small coherence. Also, κ A), i.e., the condition number of the original problem instance, does not enter into the bound on κ A R ) 1. If we are willing to be very aggressive in downsampling, then the condition number of the preconditioned system might not be small enough. In this case, all is not lost a high condition number might lead to a large number of iterations of LSQR, but we might have a distribution of eigenvalues that is not too bad and leads to a number of iterations that is not too bad. In particular, the convergence of LSQR depends on the full distribution of the singular value of κ AR 1) and not just the ratio of the largest to smallest considering just that ratio leads to sufficient but not necessary conditions), and if only a few singular values are bad then this can be dealt with. We will return to this topic below.

As we will see, the proof of this lemma directly uses ideas, e.g., subspace preserving embeddings, that were introduced in the context of low-precision RandNLA solvers, but it uses them toward a somewhat different aim. Proof:[of Lemma 1] To prove the lemma, we need the following specialization of a result we stated toward the beginning of the semester. Again, since the lemma is formulated in terms of a uniform sampling process, we state the following lemma as having the coherence factor µ U)), which is necessary when uniform sampling is used. We saw this approximate matrix multiplication result before when the uniform sampling operator was replaced with a random projection operator or a non-uniform sampling operator, and in both cases the coherence factor did not appear. Lemma Let U R n d be an orthogonal matrix, and let S R n n be a uniform sampling-andresacling operator. Then E [ I U T S T SU ] mµ U) logr) C. r Since SU is full rank, so too is SA full rank. Then, we can claim that κ SU) = κ A R 1). To prove the claim, recall that SU = U SU Σ SU V T SU by definition SA = Q R by definition U SU = QW, for a d d unitary matrix W, since they span the same space. In this case, it follows that R = Q T SA = Q T SUΣV T = Q T U SU Σ SU V T SUΣV T = W Σ SU V T SUΣV T. From this and since Σ SU is invertible, by the approximate matrix multiplication bound, since we have sampled sufficiently many columns), it follows that A R 1 = UΣV T V Σ 1 V SU Σ 1 SU W = UV SU Σ 1 SU W. From this, it follows that A R 1) T A R 1 = W Σ 1 SU V SUU T T UV SU Σ 1 SU W T = W Σ SU W T = Σ SU = Σ 1, SU 3

where the penultimate equality follows since W is orthogonal and the last equality follows since Σ SU is diagonal. Similarly, A R 1) T A R 1)) 1 = Σ SU This, using that κ α) = α T α α T α ) 1 ) 1/ for a matrix α, if follows that κ A R 1) = This establishes the claim. A R 1) T A R 1 = Σ 1 SU Σ SU = Σ 1 SU Σ SU = κ SU). ) 1/ So, given that claim, back to the main proof. Recall that By Markov s Inequality, it follows that E [ I U T S T SU ] τ. A R 1) ) T 1 A R 1 Pr [ I U T S T SU δ 1 τ ] δ. This, with probability 1 δ, we have that I U T S T SU < δ 1 τ < 1, and thus SU is full rank. Next, recall that every eigenvalue λ of U T S T SU is the Rayleigh quotient of some vector x 0, i.e., λ = xt U T S T SUx x T x where η is the Rayleigh quotient of I U T S T SU. = xt x x T U T S T SU I ) x x T x = 1 + η, Since this is a symmetric matrix, it s singular values are the absolute eigenvalues. Thus, η < δ 1 τ. Thus, all the eigenvalues of U T S T SU are in the interval 1 δ 1 τ, 1 + δ 1 τ ). Thus, κ SU) 1 + δ 1 τ 1 δ 1 τ. ) 1/ 4

1. Randomness in error guarantees versus in running time So far, we have been describing algorithms in the deterministic running time and probabilistic error guarantees framework. That is, we parameterize/formulate the problem such that we guarantee that we take not more than O f n, ɛ, δ)) time, where f n, ɛ, δ)) is some function of the size n of the problem, the error parameter ɛ, and a parameter δ specifying the probability with which the algorithm may completely fail. That is most common in TCS, where simpler analysis for worst-case input is of interest, but in certain areas, e.g., NLA and other areas concerned with providing implementations, it is more convenient to parameterize/formulate the problem in a probabilistic running time and deterministic error manner. This involves making a statement of the form Pr [Number of FLOPS for ɛ relative error f n, ɛ, δ)] δ. That is, in this case, we can show that we are guaranteed to get the correct answer, but the running time is a random quantity. A few things to note. These algorithms are sometimes known in TCS as Las Vegas algorithms, to distinguish them from Monte Carlo algorithm that have a deterministic running time but a probabilistic error guarantee. Fortunately, for many RandNLA algorithms, it is relatively straightforward to convert a Monte Carlo type algorithm to a Las Vegas type algorithm. This parameterization/formulation is a particularly convenient when we downsample more aggressively than worst-cast theory permits, since we can still get a good preconditioner since a low-rank perturbation of a good preconditioner is still a good preconditioner) and we can often still get good iterative properties due to the way the eigenvalues cluster. In particular, we might need to iterate more, but we won t fail completely. 1.3 Putting it all together into the Blendenpik algorithm With all this in place, here is the basic Blendenpik algorithm. We will go into more detail on how/why this algorithm works next time in terms of condition number bounds, potentially loosing rank, since r = o d logd)), etc.), but here are a few final notes for today. Depending on how aggressively we downsample, the condition number κ AR 1) might be higher than 1+ɛ. If it is too much larger, then LSQR converges slowly, but the algorithm does not fail completely, and it eventually converges. In this sense, we get Las Vegas guarantees. Since the downsampling is very aggressive, the preconditioner can actually be rank-deficient or ill-conditioned. The solution that Blendenpik uses is to estimate the condition number with a condition number estimator from LAPACK, and if it is ɛ 1 mach /5 then randomly project and sample again. 5

Algorithm 1 The Blendenpik algorithm. Input: A R n d, and b R n. Output: x opt R d 1: while Not returned do : Compute HDA and HDb, where HD is one of several Randomized Hadamard Transforms. 3: Randomly sample γd/n rows of A and corresponding elements of b, where γ d, and let S be the associated sampling-and-rescaling matrix. 4: Compute SHDA = QR. 5: κ = κ estimate R), with LAPACK s DTRCON routine. 6: 7: if κ 1 > 5ɛ mach then x opt = LSQR A, b, R, 10 14) and return 8: else if Number of iterations > 3 then 9: Call LAPACK and return 10: end if 11: end while In Blendenpik, they use LSQR, but one could use other iterative procedures, and one gets similar results. We will go into more detail on these topics next time. 6