Least-Squares Regression on Sparse Spaces

Least-Squares Regression on Sparse Spaces Yuri Grinberg, Mahi Milani Far, Joelle Pineau School of Computer Science McGill University Montreal, Canaa {ygrinb,mmilan1,jpineau}@cs.mcgill.ca 1 Introuction Compresse sampling has been stuie in the context of regression theory from two prospectives. One is when given a training set, we aim to compress the set into a smaller size by combining training instances using ranom projections see e.g. [1]. Such metho is useful, for instance, when the training set is too large or one has to hanle privacy issues. Another application is when one uses ranom projections to project each input vector into a lower imensional space, an then train a preictor in the new compresse space compression on the feature space. As is typical of imensionality reuction techniques, this will reuce the variance of most preictors at the expense of introucing some bias. Ranom projections on the feature space, along with least-squares preictors are stuie in [2], an the metho is shown to reuces the estimation error at the price of a controlle approximation error. The analysis in [2] provies on-sample error bouns an extens them to bouns on the sampling measure, assuming an i.i.. sampling strategy. This paper inclues the bias variance analysis of regression in compresse spaces when ranom projections are applie on sparse input signals. We show that the sparsity assumption let us work with arbitrary non i.i.. sampling strategies an we erive a worst-case boun on the entire space. Such a boun can be use to select the optimal size of projection, such as to minimize the sum of expecte estimation an preiction errors. It also provies the means to compare the error of linear preictors in the original an compresse spaces. 2 Notations an Sparsity Assumption Throughout this paper, column vectors are represente by lower case bol letters, an matrices are represente by bol capital letters.. enotes the size of a set, an. 0 is Donoho s zero norm inicating the number of non-zero elements in a vector.. enotes the L 2 norm for vectors an the operator norm for matrices: M = sup v Mv / v. Also, we enote the Moore-Penrose pseuo-inverse of a matrix M with M an the smallest singular value of M by σ M min. We will be working in sparse input spaces for our preiction task. Our input is represente by a vector x X of D features, having x 1. We assume that x is k-sparse in some known or unknown basis Ψ, implying that X {Ψz, s.t. z 0 k an z 1}. 3 Ranom Projections an Inner Prouct It is well known that ranom projections of appropriate sizes preserve enough information for exact reconstruction with high probability see e.g. [3, 4]. In this section, we show that a function almost-linear in the original space is almost linear in the projecte space, when we have ranom projections of appropriate sizes.

There are several types of ranom projection matrices that can be use. In this work, we assume that each entry in a projection Φ D is an i.i.. sample from a Gaussian 1 : φ i,j = N 0, 1/. 1 We buil our work on the following base on theorem 4.1 from [3], which shows that for a finite set of points, inner prouct with a fixe vector is almost preserve after a ranom projection. Theorem 1. Let Φ D be a ranom projection accoring to Eqn 1. Let S be a finite set of points in R D. Then for any fixe w R D an ɛ > 0: s S : Φ T w, Φ T s w, s ɛ w s, 2 fails with probability less than 4 S + 2e ɛ2 /48. The above theorem is base on the well-known Johnson Linenstrauss lemma see [3], which consiers ranom projections of finite sets of points. We erive the corresponing theorem for sparse feature spaces. Theorem 2. Let Φ D be a ranom projection accoring to Eqn 1. Let X be a D-imensional k-sparse space. Then for any fixe w an ɛ > 0: fails with probability less than: x X : Φ T w, Φ T x w, x ɛ w x, 3 ed/k k 412/ɛ k + 2e ɛ2 /192 e k log12ed/ɛk ɛ2 /192+log 5. Note that the above theorem oes not require w to be in the sparse space, an thus is ifferent from guarantees on the preservation of inner prouct between vectors in the sparse space. Proof of Theorem 2. The proof follows the steps of the proof of theorem 5.2 from [5]. Because Φ is a linear transformation, we only nee to prove the theorem when w = x = 1. Denote Ψ to be the basis with respect to which X is sparse. Let T {1, 2,..., D} be any set of k inexes. For each set of inexes T, we efine a k-imensional hyperplane in the D-imensional input space: X T {Ψz, s.t. z is zero outsie T an z 1}. By efinition we have X = T X T. We first show that Eqn 3 hols for each X T an then use the union boun to prove the theorem. For any given T, we choose a set S X T such that we have: x X T : min x s ɛ/4. 4 s S It is easy to prove see e.g. Chapter 13 of [6] that these conitions can be satisfie by choosing a gri of size S 12/ɛ k, since X T is a k-imensional hyperplane in R n S fills up the space within ɛ/4 istance. Now applying Theorem 1, an with w = 1 we have that: s S : Φ T w, Φ T s w, s ɛ s, 5 2 fails with probability less than 412/ɛ k + 2e ɛ2 /192. Let a be the smallest number such that: x X T : Φ T w, Φ T x w, x a x, 6 hols when Eqn 5 hols. The goal is to show that a ɛ. For any given x X T, we choose an s S for which x s ɛ/4. Therefore we have: Φ T w, Φ T x w, x Φ T w, Φ T x Φ T w, Φ T s w, x + w, s + 7 Φ T w, Φ T s w, s 8 Φ T w, Φ T x s w, x s + 9 Φ T w, Φ T s w, s 10 aɛ/4 + ɛ/2. 11 1 The elements of the projection are typically taken to be istribute with N 0, 1/D, but we scale them by D/, so that we avoi scaling the projecte values see e.g. [3].

The last line is by the efinition of a, an by applying Eqn 5 with high probability. Because of the efinition of a, there is an x X T an by scaling, one with size 1, for which Eqn 6 is tight. Therefore we have a aɛ/4 + ɛ/2, which proves a ɛ for any choice of ɛ < 1. Note that there are D k possible sets T. Since D k ed/k k an X = T X T, the union boun gives us that the theorem fails with probability less than ed/k k 412/ɛ k + 2e ɛ2 /192. 4 Bias Variance Analysis of Orinary Least-Squares In this section, we analyze the worst case preiction error mae by the orinary least-squares OLS solution. For completeness, we provie bouns on OLS in the original space which is partly a classical result in linear preiction theory. Then, we procee to the main result of this paper, which is the bias variance analysis of OLS in the projecte space. We seek to preict a signal f that is assume to be a near-linear function of x X : fx = x T w + b f x, where b f x ɛ f, 12 for some ɛ f > 0, where we assume w 1. We are given a training set of n input output pairs, consisting of a full-rank input matrix X n D, along with noisy observations of f: y = Xw + b f + η, 13 where for the aitive bias term overloaing the notation b f,i = b f x i ; an we assume a homosceastic noise term η to be a vector of i.i.. ranom variables istribute as N 0, σ 2 η. Given the above, we seek to fin a preictor that for any query x X preicts the target signal fx. The following lemma provies a boun over worst-case error of the orinary least-squares preictor. Lemma 3. Let w ols be the OLS solution of Eqn 13 with aitive bias boune by ɛ f an i.i.. noise with variance ση. 2 Then for any 0 < δ var 2/eπ, for all x X, with probability no less than 1 δ var the error in the OLS preiction follows this boun: fx x T w ols x X ɛ f n + ση log2/πδ 2 var + ɛ f. 14 Proof of Lemma 3. For the OLS solution of Eqn 13 we have: w ols = X y = X Xw + b f + η = w + X b f + X η. 15 Therefore for all x X we have the error: fx x T w ols x T w ols x T w + ɛ f 16 x T X b f + x T X η + ɛ f. 17 For the first term part of preiction bias on the right han sie, we have: x T X b f x T X b f x X ɛ f n. 18 For the secon term in line 17 preiction variance, we have that the expectation of x T X η is 0, as η is inepenent of ata an its expectation is zero. We also know that it is a weighte sum of normally istribute ranom variables, an thus is normal with the variance: Var[x T X η] = E[x T X ηη T X T x] 19 = σ 2 ηx T X X T x 20 σ 2 η x T X X T x 21 σ 2 η x 2 X 2, 22 where in line 20 we use the i.i.. assumption on the noise. Thereby we can boun x T X η by the tail probability of the normal istribution as neee. Using an stanar upper boun on the tail probability of normals, when 0 < δ var 2/eπ, with probability no less than 1 δ var : x T X η σ η x X log2/πδ 2 var. 23 Aing up the bias an the variance term gives us the boun in the lemma.

5 Compresse Orinary Least-Squares We are now reay to stuy an upper boun for the worst-case error of the OLS preictor in a compresse space. In this setting, we will first project the inputs into a lower imensional space using ranom projections, an then use the OLS estimator on the compresse input signals. Theorem 4. Let Φ D be a ranom projection accoring to Eqn 1 an w Φ ols be the OLS solution in the compresse space inuce by the projection. Assume an aitive bias in the original space boune by some ɛ f > 0 an i.i.. noise with variance ση. 2 Choose any 0 < δ prj, δ Φ < 1 an 0 < δ var 2/eπ. Then, with probability no less than 1 δ prj + δ Φ, we have x X with probability no less than 1 δ var : fx x T Φw Φ ols ɛ f + ɛ prj n + σ η log2/πδ 2 var 12 1 + σ X D min 2 log δ Φ log 2 24 δ prj 1 where, +ɛ f + ɛ prj, 25 k log log12ed/kδprj ɛ prj = c. Proof of Theorem 4. Using Theorem 2, the following hols with probability no less than 1 δ prj : fx = Φ T x T Φ T w + b f x + b prj x, 26 where b f x ɛ f, b prj x ɛ prj. Note that XΦ X Φ 1/σ X with probability 1 δ Φ : min σφ min σ Φ D 2 log min δφ 1.. Using the boun iscusse in [7], we have Now, using Lemma 3 with the form of a function escribe in Eqn 26, we have: fx x T Φw Φ ols xt Φ XΦ ɛ f + ɛ prj n + σ η log2/πδ 2 var + ɛ f + ɛ prj, 27 which yiels the theorem after the substitution of ɛ prj an matrix norms. Assuming that ɛ f = 0 for simplification we can rewrite the boun as: k fx x T Φw Φ ols Õ 1 n log logd/k + σ X + Õ σ η /D. min D σ X min The first Õ term of the RHS is a part of a bias ue to the projection. The secon Õ term is the variance term. This boun is particularly useful when n > D. With that assumption, in orer to illustrate a more clear bias variance trae-off, assume that X i,j comes from N 0, 1 D. Then we have σ X min n D. Fixing the values for δ s an ignoring the log term slow growing function of, we get that the error is boune by: 1 σ c 0 + c 1 k logd/k η + 1 + c 2, n in which case we clearly observe the trae-off with respect to the compresse imension. Now if < n < D with X i,j N 0, 1 D, we have σx min 1 n D which gives us the following boun: 1 n σ c 0 + c 1 k logd/k + η + c 2. D n D n This boun is, however, counter-intuitive, as the error grows when n is increase. This is ue to the fact that the boun XΦ X Φ gets looser as n gets close to D. When n is close to D, we might not have a tight close-form boun over the XΦ term, but we can still calculate it empirically for any given value of by sampling a specific projection of the corresponing size.

References [1] S. Zhou, J. Lafferty, an L. Wasserman. Compresse regression. In Proceeings of Avances in neural information processing systems, 2007. [2] O.A. Maillar an R. Munos. Compresse least-squares regression. In Proceeings of Avances in neural information processing systems, 2009. [3] M.A. Davenport, M.B. Wakin, an R.G. Baraniuk. Detection an estimation with compressive measurements. Dept. of ECE, Rice University, Tech. Rep, 2006. [4] E.J. Canès an M.B. Wakin. An introuction to compressive sampling. Signal Processing Magazine, IEEE, 252:21 30, 2008. [5] R. Baraniuk, M. Davenport, R. DeVore, an M. Wakin. The Johnson Linenstrauss lemma meets compresse sensing. Constructive Approximation, 2007. [6] G.G. Lorentz, M. von Golitschek, an Y. Makovoz. Constructive approximation: avance problems, volume 304. Springer Berlin, 1996. [7] E.J. Canès an T. Tao. Near-optimal signal recovery from ranom projections: Universal encoing strategies. Information Theory, IEEE Transactions on, 5212:5406 5425, 2006.