A Fast Algorithm For Computing The A-optimal Sampling Distributions In A Big Data Linear Regression

Size: px

Start display at page:

Download "A Fast Algorithm For Computing The A-optimal Sampling Distributions In A Big Data Linear Regression"

Julia Newton
5 years ago
Views:

1 A Fast Algorithm For Computing The A-optimal Sampling Distributions In A Big Data Linear Regression Hanxiang Peng and Fei Tan Indiana University Purdue University Indianapolis Department of Mathematical Sciences Indianapolis, IN USA (hanxpeng, feitan)@iupui.edu March 19, 2018 Abstract: It was demonstrated in Peng and Tan(2018) that the A-optimal sampling distributions in a big data linear regression model has the same running time O(np 2 ) as the full data least squares estimator. In this article, we construct a fast algorithm to compute the sampling distributions with o(np 2 ) time and establish the relative error bounds. AMS 2000 subject classifications: Primary 62G05; secondary 62G10,62G20. Keywords and phrases: A-optimality; big data analysis; fast algorithm; Johnson-Lindenstrauss transform; leverage scores; non-uniform sampling. 1. Introduction Let X be an n p matrix with n >> p. For α R, let H α = X(X X) α X = (h (α) i,j ) =: (h α,i,j). (1.1) Then H 1 = H = (h i,j ) is the hat matrix. The Ã-optimal distribution is given by π (aopt) i = π (2) h2,i,i i = n, i = 1,...,n. (1.2) i=1 h2,i,i For details, see Peng and Tan (2018). 2. Johnson Lindenstrauss Transforms Let R n p denote the space of n p matrices with real entries. Following Drineas, et al. (2012), given ǫ > 0 and a set of n points x i in R p, an ǫ-johnson Lindenstrauss transformation (ǫ-jlt) for the set is a projection from R p into R r, identified with a matrix Π R p r, such that (1 ǫ) x i x j 2 x i Π x j Π 2 (1+ǫ) x i x j 2, i,j = 1,...,n. (2.1) 1

2 H. Peng and F. Tan/Fast Algorithm For Computing The A-optimal Sampling Distributions2 What is focal is to construct a fast ǫ-jlt. A popular method in the literature is to construct a random projection that is an ǫ-jlt with high probability. Specifically, choose every entry of the projection to be i.i.d. with the random variable which takes the values ± 3/r with probability 1/6 each and zero otherwise. Let Π JLT denote such a random projection. The following result from Theorem 1.1 of Achlioptas (2003) ensures that Π JLT is such an ǫ-jlt. Lemma 2.1. Let x 1,...,x n be a set of n points in R p. Let ǫ (0, 1) be an accuracy parameter and δ be a probability of failure. If r 4lnn+2ln1/δ ǫ 2 /2 ǫ 3 /3, (2.2) then it holds with probability at least 1 δ that the r p random matarix Π JLT described above satisfies (2.1). Assume Π is an ǫ-jlt. Since it is a linear, we have (1 ǫ) x i 2 x i Π 2 (1+ǫ) x i 2, i = 1,...,n. (2.3) An ǫ-jlt maps a point in R p into R r and distorts the distance of two points within 1±ǫ. While an ǫ-jlt retains the local properties, the fast JLT (FJLT) satisfies stronger conditions than the ǫ-jlt. Following Drineas, et al. (2012), let U be an orthogonal matrix on R n p and view its columns as p vectors in R n. A projection from R n to R r, identified with a matrix Π R r n, is an ǫ-fjlt for U if it satisfies Approximate orthogonality: (ΠU) (ΠU) I p o ǫ. Fast running time: for M R n p, the matrix product ΠM can be computed in O(nplnr) time. A fast JLT possesses nice properties given in Lemma 2 of Drineas, et al. (2012). We restate these properties in Lemma 4.1 below for our use. Computing the matrix product ΠM takes O(npr) time. An ǫ-fjlt beats this running time with high probability and can be constructed by employing a randomized Hadamard transformation (RHT). An n n Hadamard matrix can be recursively defined as H n = n 1/2 Hn, where H 1 = 1 and ( ) Hn Hn H 2n =. H n H n Here for simplicity we assume n is the power of 2. For general construction, numerical implementation and evaluation, see e.g. Avron, et al. (2010). The Hadamard matrix encodes the discrete Fourier transformation over the additive group (Z/(2Z)) n : its FFT is particularly simple and requires O(nlogn) time. LetD R n n bearandomdiagonalmatrixwhosediagonalentriesarei.i.d.with the random variable D = ±1 with probability 1/2 each. The product HD is a RHT and possesses useful properties as remarked in Drineas, et al.(2012). When applied to a vector, it spread out the energy; Computing the product HDx

3 H. Peng and F. Tan/Fast Algorithm For Computing The A-optimal Sampling Distributions3 Fig 1. Fast approximation to the A-optimal sampling distribution Input: X R n p (with SVD X = UΛV ), error parameter ǫ (0, 1), and failure probability δ (0,1). Output:,i = 1,...,n. h (2) i,i 1. Let Π 1 R r1 n be the ǫ-fjlt for U with r 1 satisfying (2.4). 2. Compute Π 1 X R r1 p and its SVD, Π 1 X = U 1 Λ 1 V 1. Let R 1 = V 1 Λ View the normalized rows of XR 1 R n p as n vectors in R p, and compute an ǫ-jlt Π 2 R p r2 for the n vectors and their n 2 n pairwise sums with r 2 satisfying (2.2). 4. Compute the matrix product P = XR 1 R Π Compute h (2) i,i = e i P 2 for i = 1,...,n. for x R n only takes O(nlog 2 n) time; And accessing r components in HDx takes O(nlog 2 r) time. The subsampled randomized Hadamard transformation (SRHT) uniformly samples a set of r rows of a RHT. This HDx plays the role of preconditioning the input matrix and then one takes a uniform subsample of the rows of the resulting input matrix. Let S be a r n sampling matrix which uniformly samples r rows of an n d matrix. Let Π FJLT = S HD. As shown in Lemma 3 of Drineas, et al. (2012), Π FJLT is a FJLT for a large value of r with high probability. Below we state the result with a slightly improved lower bound for the subsample size r tailored for our applications. Lemma 2.2. Let Π FJLT be a r n random matrix obtained from the SRHT described above. Let U be a n p orthogonal matrix with n > p. If r 64pln(40np) ǫ 2 ln 64pln(40np) ǫ 2 δ then it holds with probability at least 1 δ that Π FJLT is a ǫ-fjlt for U. (2.4) Note that the lower bound for the subsample size r in Drineas, et al. (2012) is r 196ǫ 2 pln(40np)ln(900ǫ 2 pln(40np)) for δ = Fast Approximation of the A-optimal Distribution Let e i R n be the vector with the ith entry one and all others zero. As X has full rank, (X X) 1 = X + (X + ), so that the ith diagonal entry of H 2 is h (2) i,i = e i H 2 e i = e i X(X X) 1 2 = e i XX + (X + ) 2. (3.1) Similar tothefastcomputing fortheleverage scoresindrineas,etal.(2012), the two bottlenecks of computing h (2) i,i for the A-optimal distribution according to

4 H. Peng and F. Tan/Fast Algorithm For Computing The A-optimal Sampling Distributions4 (3.1) is two fold: first, computing the pseudoinverse and second, performing the matrix multiplications. Both of them take O(np 2 ) time. We will follow Drineas, et al. to get around the bottlenecks by the judiciously choosing random projections to (3.1). To get around the bottleneck of O(np 2 ) time due to computing X + in (3.1), we will compute the pseudoinverse of a smaller matrix that approximates X. This is done by approximating X by Π 1 X, where Π 1 R r1 p is an ǫ-fjlt for the left singular vector matrix U of X and chosen as the SRHT of Lemma 2.2. Computing this way the products in (3.1) reduces the time to O(npr 1 ). This is still not efficient as r 1 > p required by Lemma 2.2. To get around this bottleneck, we can further reduce the dimensionality by using an ǫ- JLT of Lemma 2.1 to reduce the dimension r 1 > p to r 2 = O(lnn). Specifically, we approximate h (2) i,i by h (2) i,i = e i X(Π 1 X) + (Π 1 X) + Π 2 2, (3.2) where Π 2 R p r2. This is realized by the algorithm in Fig. 1. Below we give the relative error bound and the running time of the procedure. Theorem 3.1. Let X be a n p matrix of full rank p with n p. Let ǫ (0, 1) be an error parameter and δ (0, 1) be a failure probability parameter. Let h (2) i,i,i = (2) 1,...,n be approximated by the output h i,i of the randomized algorithm given in Fig. 1. Then it holds with probability at least (1 δ) 2 that for i = 1,...,n, h (2) i,i h(2) i,i ǫ ( 2ǫ 2 +ǫ (1 ǫ) 2κ2 (Λ)+ 4ǫ+2 1 ǫ κ(λ)+2 ) h i,i (2), (3.3) where κ(x) = σ max (X)/σ min (X) is the condition number of X. Remark 3.1. Following the computing of the running time of the algorithm for the leverage scores in Theorem 1 of Drineas, et al. (2012), the running time for the algorithm in Fig. 1 is O(npln(r 1 )+npr 2 +r 1 p 2 +r 2 p 2 +p 3 ), as there is only one additional matrix multiplication R 1 R, which takes O(p 3 ) time. Thus the asymptotic running time of our algorithm is the same as the algorithm in Drineas, et al. (2012): O(npln(pǫ 1 )+npǫ 2 ln(n)+p 3 ǫ 2 ln(n)ln(pǫ 1 )). Treating ǫ as a constant, the asymptotic running time of our algorithm is O(npln(n)+p 3 ln(n)ln(p)), provided that p n exp(p). The running time is o(np 2 ) if pln(p) = o(n/ln(n)) and ln(n) = o(p).

5 H. Peng and F. Tan/Fast Algorithm For Computing The A-optimal Sampling Distributions5 4. Proofs for the Fast Algorithm We restate Lemma 2 of Drineas, et al. (2012) below. Lemma 4.1. Let M be an n d matrix of full rank d with n >> d. Let the SVD of M be M = UΣV. Let Π be an ǫ-fjlt for U with 0 < ǫ 1 and let the SVD of ΠU be ΠU = U 1 Σ 1 V 1. Then ΠM, ΠU, M and U have the same rank d. Moreover, I d Σ 1 o ǫ/(1 ǫ) and (ΠM) + = VΣ 1 (ΠU) +. Lemma 4.2. Let Y 1,...,Y n be i.i.d. d-dimensional complex random vectors with Y 1 M a.s. and E(Y 1 Y 1) o 1. Then for any t > 0, ( 1 P n n ) Y j Yj E(Y 1 Y1) t o j=1 (2n) ( nt 1+t 2 1 ) exp 2M 2 (1+. 1+t) 2 NotewereducedOliverira slowerbound(2n) 2 exp( nt 2 16M 2 +8M 2 t )totheabove one. Proof of Lemma 4.2. We shall optimize the lower bound in the proof of Lemma 1 of Oliveira (2010), which is f(s) = (2n) 1/(1 2M2s/n)) exp ( st+ 2M2 s 2 /n ), t > M 2 s/n To minimize f(s), we seek the choice of s of the form s 0 = n t 2M 2 t+b, b 0. Simple calculus shows that b = 1+ 1+t and the desired result follows from simplifying f(s 0 ). Proof of Lemma 2.2. Using the improved bound in Lemma 4.2 below and following Lemmas 3 and 4 of Drineas, et al. (2010), we obtain the desired improved bound. Proof of Theorem 3.1. Introduce v i = e i XX + X +, ˆv i = e i X(Π 1 X) + (Π 1 X) +, ṽ i = ˆv i Π 2. With these we have h (2) i,i = v i 2, h(2) i,i = ṽ i 2. (4.1) For two vectors a,b, the inner product is a,b = a b. We show below ˆv i,ˆv j v i,v j ǫ κ(λ) ( ǫκ(λ) ) 1 ǫ 1 ǫ +2 v i v j, (4.2) ṽ i,ṽ j ˆv i,ˆv j 2ǫ ˆv i ˆv j, i,j = 1,...,n, (4.3)

6 H. Peng and F. Tan/Fast Algorithm For Computing The A-optimal Sampling Distributions6 and ˆv i ( ǫκ(λ) 1 ǫ +1 ) v i, (4.4) where κ(λ) = Λ o Λ 1 o = σ max (X)/σ min (X) = κ(x). Note that (4.3) is straightforward from the definition of the ǫ-jlt. By Lemma 2.2, both (4.2) and (4.3) hold with probability at least 1 δ, while (4.4) holds with probability at least 1 δ by Lemma 2.1. Consequently, (4.2) (4.4) hold with probability at least (1 δ) 2 as Π 1 and Π 2 are independent random matrices. From these the desired (3.3) immediately follows in view of h (2) i,i h(2) i,i ṽ i,ṽ i ˆv i,ˆv i + ˆv i,ˆv i v i,v i. To prove (4.2), we use the SVD, X = UΛV, where U(V) is the n p (p p) orthonormal matrix consisting of the left (right) singular vectors as their columnsandλisthep psingularvaluematrix.asxhasfullrankp,allu,v,λ have the same rank p. By Lemmas 4.1 and 2.4 it holds with probability at 1 δ that (Π 1 X) + = VΛ 1 (Π 1 U 1 ) +. Using the singular value decompositions, we can express v i and ˆv i as Thus where v i = e i UΛ 1 V, ˆv i = e i U(Π 1 U) + (Π 1 U) + Λ 1 V. (4.5) ˆv i,ˆv j v i,v j A o v i v j, A = Σ(Π 1 U) + (Π 1 U) + Σ 2 (Π 1 U) + (Π 1 U) + Σ I p. Let B = (Π 1 U) + (Π 1 U) +. Then A o B I p 2 oκ 2 (Λ)+2 B I p o κ(λ). (4.6) AsΠ 1 Uhasfullrankpwithprobabilityatleast1 δ,wehaveb = ((Π 1 U) (Π 1 U)) 1 and B I p o (Π 1 U) (Π 1 U) I p o B o ǫ( B I p o +1), where we used the defining property of an ǫ-fjlt. Thus B I p o ǫ/(1 ǫ). Substitution of this in (4.6) gives (4.2). Using this and the second representation in (4.5), the desired (4.4) follows from ˆv i ΛBΛ 1 o v i (κ(λ) B I p o +1) v i.

7 H. Peng and F. Tan/Fast Algorithm For Computing The A-optimal Sampling Distributions7 References [1] Achlioptas, D. (2003). Database-friendly random projections: Johnson - lindenstrauss with binary coins. Journal of Computer and System Sciences, 66(4): [2] Avron, H., Maymounkov, P. and Toledo, S. (2010). Blendenpik: Supercharging LAPACK s least-squares solver. SIAM Journal on Scientific Computing, 32: [3] Baxter, J., Jones, R., Lin, M. and Olsen, J. (2004). SLLN for Weighted Independent Identically Distributed Random Variables. J. Theoret. Probab., 17: doi: /b:jotp d. [4] Candés, E.J. and Tao, T. (2009). Exact Matrix Completion via Convex Optimization. Found Comput Math 9: 717. doi: /s [5] Drineas P., Kannan R. and Mahoney M.W. (2006). Fast Monte Carlo algorithms for matrices I: Approximating matrix multiplication. SIAM Journal on Computing, 36: [6] Drineas, P., Mahoney,M.W., Muthukrishnan, S. (2008). [7] Drineas, P., Mahoney,M.W., Muthukrishnan, S. and Sarlós, T. (2010). Faster least squares approximation. Numerische Mathematik, 117(2): [8] Drineas P., Mahoney M.W. and Muthukrishnan S. (2006). Sampling algorithms for l 2 regression and applications. Proceedings of the 17th Annual ACM-SIAM Symposium on Discrete Algorithms, pages [9] Drineas P., Magdon-Ismail, M., Mahoney M.W. and Woodruff, D.P. (2012). Fast approximation of matrix coherence and statistical leverage. The Journal of Machine Learning Research, 13: [10] Oliveira, R. I. (2010). Sums of random Hermitian matrices and an inequality by Rudelson. Technical report. Preprint: arxiv: v1. [11] Mahoney, M. W. and Drineas, P. (2009). [12] Mahoney, M. W. (2011). Randomized algorithms for matrices and data. Foundations and Trends in Machine Learning. NOW Publishers, Boston. [13] Drineas, P., Mahoney, M.W, Muthukrishnan, S. and Sarlós, T. (2010). Faster least squares approximation. Numerische Mathematik, 117(2): [14] Papadimitriou, C.H., Raghavan, P., Tamaki, H. and Vempala, S. (2000). Latent semantic indexing: a probabilistic analysis. J. Computer and System Sciences, 61(2): [15] Sarlós, T. (2006). Improved approximation algorithms for large matrices via random projections. In Proceedings of the 47th Annual IEEE Symposium on Foundations of Computer Science, pages [16] Sarlós, T. (2010). [17] Drineas, P., Mahoney, M.W. and Muthukrishnan, S. (2008). Relative-error CUR matrix decompositions. SIAM Journal on Matrix Analysis and Applications, 30: [18] Mahoney, M.W. and Drineas, P. (2009). CUR matrix decompositions for improved data analysis. Proc. Natl. Acad. Sci. USA, 106:

8 H. Peng and F. Tan/Fast Algorithm For Computing The A-optimal Sampling Distributions8 [19] Boutsidis, C., Mahoney, M.W. and Drineas. P. (2009). An improved approximation algorithm for the column subset selection problem. In Proceedings of the 20th Annual ACM-SIAM Symposium on Discrete Algorithms: p [20] Teicher, H.(1974). On the law of the iterated logarithm. Ann. Probability 2: [21] Wang, C., Chen, M.-H., Schifano, E., Wu, J. and Yan, J. (2015). A Survey of Statistical Methods and Computing for Big Data. arxiv:

Sketching as a Tool for Numerical Linear Algebra

Sketching as a Tool for Numerical Linear Algebra (Part 2) David P. Woodruff presented by Sepehr Assadi o(n) Big Data Reading Group University of Pennsylvania February, 2015 Sepehr Assadi (Penn) Sketching