Lecture 7: Sampling/Projections for Least-squares Approximation, Cont. 7 Sampling/Projections for Least-squares Approximation, Cont.

Stat60/CS94: Randomized Algorithms for Matries and Data Leture 7-09/5/013 Leture 7: Sampling/Projetions for Least-squares Approximation, Cont. Leturer: Mihael Mahoney Sribe: Mihael Mahoney Warning: these notes are still very rough. They provide more details on what we disussed in lass, but there may still be some errors, inomplete/impreise statements, et. in them. 7 Sampling/Projetions for Least-squares Approximation, Cont. We ontinue with the disussion from last time. There is no new reading, just the same as last lass. Reall that last time we provided a brief overview of LS problems and a brief overview of skething methods for LS problems. For the latter, we provided a lemma that showed that under ertain onditions the solution of a skethed LS problem was a good approximation to the solution of the original LS problem, where good is with respet to the objetive funtion value. Today, we will fous on three things. Establishing goodness results for the skethed LS problem, where goodness is with respet to the ertifiate or solution vetor. Relating these two strutural onditions and the satisfation of the two onditions by random sampling/projetion to exat and approximate matrix multipliation algorithms. Putting everything together into two basi (but still slow we ll speed them up soon enough) RandNLA algorithms for the LS problem. 7.1 Deterministi and randomized skethes and LS problems, ont. Last time we identified two strutural onditions, and we proved that if those strutural onditions are satisfied by a skething matrix, then the solution to the subproblem defined by that skething matrix has a solution that is a relative-error approximation to the original problem, i.e., that the objetive funtion value of the original problem is approximate well. Now, we will prove that the vetor itself solving the subproblem is a good approximation of the vetor solving the original problem. After that, we will show that random sampling and random projetion matries satisfy those two strutural onditions, for appropriate values of the parameter settings. Lemma 1 Same setup as the previous lemma. Then x opt x opt 1 ɛz. (1) σ min (A) 1

Proof: If we use the same notation as in the proof of the previous lemma, then A(x opt x opt ) = U A z opt. If we take the norm of both sides of this expression, we have that x opt x opt U Az opt σ min (A) () ɛz σ min (A), (3) where () follows sine σ min (A) is the smallest singular value of A and sine the rank of A is d; and (3) follows by a result in the proof of the previous lemma and the orthogonality of U A. Taking the square root, the seond laim of the lemma follows. If we make no assumption on b, then (1) from Lemma 1 may provide a weak bound in terms of x opt. If, on the other hand, we make the additional assumption that a onstant fration of the norm of b lies in the subspae spanned by the olumns of A, then (1) an be strengthened. Suh an assumption is reasonable, sine most least-squares problems are pratially interesting if at least some part of b lies in the subspae spanned by the olumns of A. Lemma Same setup as the previous lemma, and assume that U A UA T b γ b, for some fixed γ (0, 1]. Then, it follows that x opt x opt ( ɛ κ(a) ) γ 1 x opt. (4) Proof: Sine U A U T A b γ b, it follows that Z = b UA U T A b (γ 1) U A U T A b σ max(a)(γ 1) x opt. This last inequality follows from U A U T A b = Ax opt, whih implies UA U T A b = Ax opt A x opt = σ max (A) x opt. By ombining this with eqn. (1) of Lemma 1, the lemma follows. 7. Connetions with exat and approximate matrix multipliation 7..1 An aside: approximating matrix multipliation for vetor inputs Before ontinuing with our disussion of LS regression, here is a simple example of applying the matrix multipliation ideas that might help shed some light on the form of the bounds as well as when the bounds are tight and when they are not. Let s say that we have two vetors x, y R n and we want to approximate their produt by random sampling. In this ase, we are approximating x T y as x T SS T y, where S is a random sampling

matrix that, let s assume, is onstruted with nearly optimal sampling probabilities. Then, our main bound says that, under appropriate assumptions on the number n of random samples we draw, then we get a bound of the form x T y x T SS T y F ɛ x F y F, whih, sine we are dealing with the produt of two vetors simplifies to x T y x T SS T y ɛ x y. The question is: when is this bound tight, and when is this bound loose, as a funtion of the input data? Clearly, if x y, i.e., if x T y = 0, then this bound will be weak. In the other hand if y = x, then x T y = x T x = x, in whih ase this bound says that x T x x T SS T x ɛ x, meaning that the algorithm provides a relative error guarantee on x T x = x. (We an make similar statements more generally if we are multiplying two retangular orthogonal matries to form a low-dimensional identity, and this is important for providing subspae-preserving skethes.) The lesson here is that when there is anellation the bound is weak, and that the sales set by the norms of the omponent matries are in some sense real. For general matries, the situation is more omplex, sine subspae an interat in more ompliated ways, but the similar ideas goes through. 7.. Understanding and exploiting these strutural onditions via approximate matrix multipliation These lemmas say that if our skething matrix X satisfies Condition I and Condition II, then we have relative-error approximation on both the solution vetor/ertifiate and on the value of the objetive at the optimum. There are a number of things we an do with this, and here we will fous on establishing a prioi running time guarantees for any, i.e., worst-ase, input. But, before we get into the algorithmi details, however, we will outline how these strutural onditions relate to our previous approximate matrix multipliation results, and how we will use the latter to prove our results. The main point to note is that both Condition I and Condition II an be expressed as approximate matrix multipliations and thus bounded by our approximate matrix multipliation results from a few lasses ago. To see this, observe that a slightly stronger ondition than Condition I is that 1 σ 1 (XU A ) 1/ i, and that U A is an n d orthogonal matrix, with n d, we have that U T A U A = I d, and so this latter ondition says that U T A U A U T A XT XU A in the spetral norm, i.e., that I (XU A ) T XU A 1/. Similarly, sine UA T b = 0, Condition II says that 0 UA T XT Xb with respet to the Frobenius norm, i.e., that UA T X T Xb ɛ F Z 3

Of ourse, this is simply the Eulidean norm, sine b is simply a vetor. For general matries, for the Frobenius norm, the sale of the right hand side error, i.e., the quantity that is multiplied by the ɛ, depends on the norm of the matries entering into the produt. But, the norm of b is simply the residual value Z, whih sets the sale of the error and of the solution. And, for general matries, for the spetral norm, there were quantities that depended on the spetral and Frobenius norm of the input matries, but for orthogonal matries like U A, those are 1 or the low dimension d, and so they an be absorbed into the sampling omplexity. 7..3 Bounds on approximate matrix multipliation when information about both matries is unavailable The situation about bounding the error inurred in the two strutural onditions is atually somewhat more subtle than the previous disussion would imply. The reason is that, although we might have aess of information suh as the leverage sores (row norms of one matrix) that depend on U A, we in general don t have aess to any information in b (and thus the row norms of it). Nevertheless, an extension of our previous disussion still holds, and we will desribe it now. Observe that the nearly optimal probabilities p k β A (k) B(k) A (k ) B(k, ) n k =1 for approximating the produt of two general matries A and B use information from both matries A and B in a very partiular form. In some ases, suh detailed information about both matries may not be available. In partiular, in some ases, we will be interested in approximating the produt of two different matries, A and B, when only information about A (or, equivalently, only B) is available. Somewhat surprisingly, in this ase, we an still obtain partial bounds (i.e., similar form, but with slightly weaker onentration) of the form we saw above. Here, we present results for the BasiMatrixMultipliation algorithm for two other sets of probabilities. In the first ase, to estimate the produt AB one ould use the probabilities (5) whih use information from the matrix A only. In this ase AB CR F an still be shown to be small in expetation; the proof of this lemma is similar to that of our theorem for the nearly-optimal probabilities from a few lasses ago, exept that the indiated probabilities are used. Lemma 3 Suppose A R m n, B R n p, Z + suh that 1 n, and {p i } n i=1 n i=1 p i = 1 and suh that are suh that p k β A (k) A (5) F for some positive onstant β 1. Construt C and R with the BasiMatrixMultipliation algorithm, and let CR be an approximation to AB. Then: [ ] E AB CR F 1 β A F B F. (6) Following the analysis of our theorem for the nearly-optimal probabilities from a few lasses ago, we an let M = max α B (α) A (α), let δ (0, 1) and let η = 1 + A F B F M (8/β) log(1/δ), in whih ase 4

it an be shown that, with probability at least 1 δ: AB CR F η β A F B F. Unfortunately, the assumption on M, whih depends on the maximum ratio of two vetor norms, is suffiiently awkward that this result is not useful. Nevertheless, we an still remove the expetation from Eqn. (6) with Markov s inequality, paying the fator of 1/δ, but without any awkward assumptions, assuming that we are willing to live with a result that holds with onstant probability. This will be fine for several appliations we will enounter, and when we use Lemma 3, this is how we will use it. We should emphasize that for most probabilities, e.g., even simple probabilities that are proportional to (say) the Eulidean norm of the olumns of A (as opposed to the norm-squared of the olumns of A or the produt of the norms of the olumns of A and the orresponding rows of B), we obtain muh uglier and unusable expressions, e.g., we get awkward fators suh as M above. Lest the reader thing that any sampling probabilities will yield interesting results, even for the expetation, here are the analogous results if sampling is performed u.a.r. Note that the saling fator of n is muh worse than anything we have seen so far and it means that we would have to hoose to be larger than n to obtain nontrivial results, learly defeating the point of random sampling in the first plae. Lemma 4 Suppose A R m n, B R n p, Z + suh that 1 n, and {p i } n i=1 are suh that p k = 1 n. (7) Construt C and R with the BasiMatrixMultipliation algorithm, and let CR be an approximation to AB. Then: ( n ) n 1/ E [ AB CR F ] A (k) B(k). (8) k=1 Furthermore, let δ (0, 1) and γ = n 8 log (1/δ) maxα A (α) B (α) ; then with probability at least 1 δ: ( n ) n 1/ AB CR F A (k) B(k) + γ. (9) k=1 7.3 Random sampling and random projetion for LS approximation So, to take advantage of the above two strutural results and bound them with our matrix multpliation bounds, we need to perform the random sampling with respet to the so-alled statistial leverage sores, whih are defined as U (i), where U (i) is the i th row of any orthogonal matrix for span(a). If we normalize them, then we get the leverage sore probabilities: p i = 1 d U (i). (10) These will be important for our subsequent disussion, and so there are several things we should note about them. 5

Sine U is an n d orthogonal matrix, the normalization is just the lower dimension d, i.e., d = U F. Although we have defined these sores i.t.o. a partiular basis U, they don t depend on that partiular basis, but instead they depend on A, or atually on span(a). To see this, let P A = AA + be a projetion onto the span of A, and note that P A = QRR 1 Q = QQ T, where R is any square non-singular orthogonal transformation between orthogonal matries for span(a). So, in partiular, up to the saling fator of 1 d, the leverage sores equal the diagonal elements of the projetion matrix P A : (P A ) ii = ( U A UA T )ii = U A(i) = ( Q A Q T A )ii = QA(i) Thus, they are equal to the diagonal elements of the so-alled hat matrix. These are sores that quantify where in the high-dimensional spae R n the (singular value) information in A is being sent (independent of what that information is). They apture a notion of leverage or influene that the i th onstraint has on the LS fit. [ ] I They an be very uniform or very nonuniform. E.g., if U A =, then they are learly very 0 nonuniform, but if U A onsists of a small number of olumns from a trunated Hadamard matrix or a dense Gaussian matrix, then they are uniform or nearly uniform. With that in plae, here we will present two algorithms that ompute relative-error approximations to the LS problem. First, we start with a random sampling algorithm, given as Algorithm 1. Algorithm 1 A slow random sampling algorithm for the LS problem. Input: An n d matrix A, with n d, an n-vetor b Output: A d-vetor x opt U(i) 1: Compute p i = 1 d, for all i [n], from the QR or the SVD. : Randomly sample r O( d log d ɛ ) rows of A and elements of b, resaling eah by 1 SA and Sb 3: Solve min x R d SAx Sb with a blak box to get x opt 4: Return x opt rp it, i.e., form For this algorithm, one an prove the following theorem. The idea of the proof is to ombine the strutural lemma with matrix multipliation bounds that show that under appropriate assumptions on the size of the sample, et., that the two strutural onditions are satisfied. Theorem 1 Algorithm 1 returns a (1±ɛ)-approximation to the LS objetive and an ɛ-approximation to the solution vetor. Next, we start with a random projetion algorithm, given as Algorithm. For this algorithm, one an prove the following theorem. As before, the idea of the proof is to ombine the strutural lemma with the random projetion version of matrix multipliation bounds 6

Algorithm A slow random projetion algorithm for the LS problem. Input: An n d matrix A, with n d, an n-vetor b Output: A d-vetor x opt 1: Let S be a random projetion matrix onsisting of saled i.i.d. Gaussians, {±1}, et., random variables. : Randomly projet onto r O( d log d ɛ ) rows, i.e., linear ombination of rows of A and elements of b 3: Solve min x R d SAx Sb with a blak box to get x opt 4: Return x opt that are in the first homework to show that under appropriate assumptions on the size of the sample, et., that the two strutural onditions are satisfied. Theorem Algorithm returns a (1±ɛ)-approximation to the LS objetive and an ɛ-approximation to the solution vetor. We are not going to go into the details of the proofs of these two theorems, basially sine they will parallel proofs of fast versions of these two results that we will disuss in the next few lasses. But, it worth pointing out that you do get good quality-of-approximation bounds for the LS problem with these algorithms. The problem is the running time. Both of these algorithms take at least as long to run (at least in terms of worst-ase FLOPS in the RAM model) as the time to solve the problem exatly with traditional deterministi algorithms, i.e., Θ(nd ) time. For Algorithm 1, the bottlenek in running time is the time to ompute the leverage sore importane sampling distribution exatly. For Algorithm, the bottlenek in running time is the time to implement the random projetion, i.e., to do the matrix-matrixmultipliation assoiated with the random projetion, and sine we are projeting onto roughly d log d dimensions the running time is atually Ω(nd ). Thus, they are slow sine they are slower than a traditional algorithms at least in terms of FLOPs in an idealized RAM model, but note that they may, and in some ases are, faster on real mahines, basially for ommuniation reasons, and similarly they might be faster in parallel-distributed environments. In partiular, the random projetion is just matrix-matrix multipliation, and this an be faster than doing things like QR or the SVD, even if the FLOP ount is the same. But, we will fous on FLOPS and so we want algorithms to runs in o(nd ) time. We will use strutured or Hadamard-based random projetions, whih an be implemented with Fast Fourier methods, so that the overall running time will be o(nd ). There will be two ways to do this: first, all a blak box (the running time bottlenek of whih is a Hadamard-based random projetion) to approximate the leverage sores, and use them as the importane sampling distribution; and seond, do a Hadamard-based random projetion to uniformize the leverage sores and sample uniformly. In the next few lasses, we will get into these issues. Why random projetions satisfy matrix multipliation bounds might be a bit of a mystery, partly sine we have foused less on it, so we get into the details of two related forms of the random projetion. Also, the blak box to approximate the leverage sores might be surprising, sine it isn t obvious that they an be omputed quikly, so we will get into that. All of the results we will desribe will also hold for general random sampling with exat leverage sores and general random projetions, but we will get into the details for the fast versions, so we an make running time laims for analogues of the fast sampling and projetion versions of above two algorithms. 7