Abstract The following linear inverse problem is considered: given a full column rank m n data matrix A and a length m observation vector b, nd the be

Size: px

Start display at page:

Download "Abstract The following linear inverse problem is considered: given a full column rank m n data matrix A and a length m observation vector b, nd the be"

Charlene Douglas
6 years ago
Views:

1 ON THE OPTIMALLITY OF THE BACKWARD GREEDY ALGORITHM FOR THE SUBSET SELECTION PROBLEM Christophe Couvreur y Yoram Bresler y General Physics Department and TCTS Laboratory, Faculte Polytechnique de Mons, Belgium. Dr C. Couvreur is also a Research Assistant of the National Fund for Scientic Research of Belgium (F.N.R.S.). Coordinated Science Laboratory and Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign, USA. Corresponding author: Prof. Yoram Bresler Coordinated Science Laboratory University of Illinois 1308 W. Main St. Urbana, IL (USA) ybresler@uiuc.edu Phone: +1 (217) FAX: +1 (217) Submitted to SIAM Journal on Matrix Analysis and Applications Revised: May 18, 1998.

2 Abstract The following linear inverse problem is considered: given a full column rank m n data matrix A and a length m observation vector b, nd the best least squares solution to Ax = b with at most r < n nonzero components. The backward greedy algorithm computes a sparse solution to Ax = b by removing greedily columns from A until r columns are left. A simple implementation based on a QR downdating scheme by Givens rotations is described. The backward greedy algorithm is shown to be optimal for this problem in the sense that it selects the \correct" subset of columns from A if the perturbation of the data vector b is small enough. 1 Introduction The problem of computing a sparse approximate solution to a set of linear equations is important in matrix computations where it is known as the \subset selection problem" [4]. Such sparse approximate solutions have applications in statistical regression [6], in function interpolation [10], and in signal processing [5, 2, 7]. In this paper, the following formulation of the subset selection problem is considered. Given a data matrix A 2 C mn, m n, and an observation vector b 2 C m, nd the best least-squares solution to Ax = b with at most r non-zero components. That is, nd the subset of r columns from A that gives the best approximation of b in the sense that kax? bk 2 is minimized over all vectors x with at most r non-zero components. Alternately, the problem can be viewed as estimating x 0 2 C n from b and the a priori knowledge that x 0 is sparse when b = Ax 0 + = b 0 + (1) and is an unknown noise vector. For instance, in [2], b is a noisy observation of a linear mixture of vector signals. A dictionary of possible signals is available, but which subset of signals from the dictionary are present in the mixture is unknown. Their mixing proportions are also unknown. It is desired to nd the signals that are eectively present in b and their mixing proportions. This estimation and detection problem can be naturally formulated as a subset selection problem of the form (1) by letting the columns of A and the elements of x 0 represent the possible signals vectors in 1

3 the dictionary and the associated mixing proportions (which are equal to zero for absent signals), respectively. The best subset of columns from A can be found by exhaustive evaluation of the least-squares residual for all possible subsets. It is possible to take advantage of the structure of the problem to reduce the size of the search space by means of Branch & Bound-type algorithms [8, 9]. However, when n increases, algorithms based on exhaustive searches (even with Branch & Bound restrictions of the search space) become rapidly impractical. Therefore, several heuristics which do not require exhaustive searches have been proposed [6]. One such heuristic is the well-known \greedy" algorithm of Golub and Van Loan [4]. The greedy algorithm is a sequential forward selection scheme. The idea of the greedy algorithm is to start by nding the column of A closest to b, and then add columns one by one until r columns have been selected, each time adding the column that gives the largest decrement of the least-squares residual. Some theoretical arguments have recently been given in favor of the greedy algorithm [10]. However, there are situations in which the greedy algorithm will fail even if there exists an exact sparse solution to Ax = b. Consider the following simple example: A = ; x = ; b = ; r = 2: Clearly, the greedy algorithm will start by erroneously selecting the rst column of A and then add the second column. In this case, the correct set of columns (second and third) cannot be selected by the greedy algorithm. In general, it is always possible to nd situations in which one or all of the heuristics that have been proposed so far will fail [6]. An alternative to the sequential forward selection used in the greedy algorithm is to use sequential backward selection, i.e., rather than adding columns to the solution one by one, it is possible to start with all columns present (i.e., with the complete matrix A) and remove one column at a time until r columns are left. The column that is removed should be chosen to minimize the increment in the least-squares residual. We call this alternative approach the \backward greedy algorithm." To avoid confusion, the standard greedy algorithm will be referred to as the \forward 2

4 greedy algorithm" in the sequel. On the simple example above, it is obvious that the backward greedy algorithm will correctly remove the rst column from A. In this case, the backward greedy algorithm correctly yields the optimal solution. More generally, as will be shown in Section 3, the backward greedy algorithm is guaranteed to yield the correct subset of non-zero components in x if the perturbation in (1) is small enough. In the following section, the backward greedy algorithm for subset selection is formally dened and a simple QR-based implementation is described. Section 3 presents our main result about the backward greedy algorithm in the form of a theorem proving that it always nd the correct subset of non-zero components of x in the small perturbation case. Some implications of this theorem are then be discussed. Numerical results illustrating the properties of the backward greedy algorithm are given in Section 4. The paper is concluded by some remarks on the choice of the number of columns r and on the NP-hardness of the subset selection problem, and by a discussion of some possible extensions of our result. 2 The Backward Greedy Algorithm Formally, the backward greedy algorithm can be dened as follows. Let = f1; : : : ; ng denote the ordered set of column indices of A, and? = f 1 ; : : : ; c(?) g,?, an ordered subset of, of cardinality c(?). The \colon" notation A(: ;?) is used to designate the matrix formed from the columns of A whose indices are in?. Similarly, x(?) designates the?-indexed elements of the vector x, and x(i: j) designates the sub-vector with elements indexed by i through j. Denote by (?) = min ka(: ;?)z? bk 2 (2) z2c c(?) the least-squares residual associated with the sparse LS solution of Ax = b based on the?-indexed subset of columns of A. The backward greedy algorithm for subset selection is initialized by taking? =. Elements are then removed from? one by one by repeating the iteration?? n fk g : k = arg min (? n fkg) (3) k2? until c(?) = r. The column k that is removed at each iteration is chosen to minimize the increment in the least-squares residual. Once the last iteration has been performed, the sparse least-squares 3

5 solution associated with the column indices left in? is computed. The subset of indices obtained at the last iteration of the backward greedy algorithm and the associated sparse least-squares solution will be denoted by ^? s and ^x s, respectively. The forward greedy algorithm is usually implemented by means of the QR algorithm for leastsquares solution of linear systems. In essence, the forward greedy algorithm is a QR factorization of A by the Gram-Schmidt procedure in which the column pivot is chosen greedily with respect to the right-hand side b of the matrix equation Ax = b (see [10, 4]). Similarly, the backward greedy algorithm can be eciently implemented by combining the QR algorithm for least-squares problems with a column-deletion QR downdating step based on Givens rotations. For the implementation of the backward greedy algorithm, it is necessary to evaluate the leastsquare residuals associated with matrices obtained by gradual removal of columns in A. Consider rst the initial step of the backward greedy algorithm. Let us assume that the QR factorization of A is available: Q H A = R, where Q H denotes the Hermitian transpose of Q. The associated least-squares residual is given by () = ke(n + 1: m)k 2, e = Q H b. Since A(: ; n fkg) is the matrix obtained by deleting the k-th column from A(: ; ), its QR factorization (Q (k) ) H A(: ; nfkg) = R (k) can be computed by \downdating" Q and R for the column deletion by a sequence of Givens rotations. The downdating operation is [4, p. 595] G H n?1 GH k R(: ; n fkg) = R(k) G H n?1 G H k e = e(k) where G i is a rotation in planes i, i + 1 for i = k: n? 1, and R(: ; n fkg) is the upper Hessenberg matrix obtained by deleting the k-th column of R. The downdated least-squares residual is given by ( n fkg) = ke (k) (n: m)k 2 : Note that there is no need to explicitly compute the downdated QR factorization Q (k) R (k). It is simply necessary to apply the sequence of Givens rotations to e to nd the downdated least-square residual. Once the minimum downdated (nfk g) has been found, the k -th column is removed from A and the process repeated recursively until only r columns are left. The sparse least-square solution 4

6 ^x s can then be computed by the usual QR approach. The backward greedy algorithm can thus be written in pseudo-code as Algorithm Backward Greedy: Input: matrix A, column vector b, integer r < n. Output: index set ^? s, column vector ^x s. Subset selection step: i n;? f1; : : : ; ng; compute Q H A = R; e Q H b; while i > r do i i? 1; for j = 1: i do compute the Givens rotations downdating G H i?1 GH j H(j) = R (j) where H (j) = R with column j deleted and R (j) upper triangular; e (j) G H i?1 GH j e; (? n f j g) ke (j) (i: m)k where j is the j-th element of?; end choose 1 k i such that the residual (? n f k g) is minimum;?? n f k g; R R (k) ; e e (k) ; end. Solution step: ^? s?; compute the solution of Rz = e(1: r) by back-substitution; ^x s (?) z ; ^x s (? c ) 0. In general, the global cost of the backward greedy algorithm will be dominated by that of 5

7 the original QR factorization and the QR downdating operations during the subset selection step. The original QR factorization of A has a cost of O(mn 2 ). Downdating a QR factorization (i.e., computing the Givens rotations and applying them to e) at step i requires O(i 2 ) operations (recall that i = n:?1: r). [4]. The complete i-th selection iteration will require O(i 3 ) operations for the i QR downdate operations, plus O(i(m? i)) operations for the evaluation of the i least-squares residuals. Once the last selection iteration has been performed, computing ^x s still requires O(r 2 ) operations. The implementation of the algorithm that has been presented in this section is reasonably ecient but it is not optimized. Its optimization would be specic to particular constraints (number of operations, storage requirements, numerical stability) and therefore is not attempted in this paper. See [1] and references therein for details on the numerical implementation of a QR downdating algorithm for least-squares solutions. We conclude this section with a short discussion of the comparative computational costs of the backward greedy algorithm and the forward greedy algorithm. The backward greedy algorithm starts with n columns and removes them one by one until only r columns are left. The rst step in the algorithm is the solution of the non-sparse least-squares problem Ax = b by QR factorization. Recalling the analogy between the forward greedy algorithm and QR factorization with column pivoting, this rst step of the backward greedy algorithm can be seen to be equivalent to the nal step of the forward greedy algorithm. Obviously, if r is small with respect to n, the computational cost of the backward greedy algorithm will far exceed that of the forward greedy algorithm. Because the overhead of the forward greedy algorithm compared to a \regular" QR factorization, the cost of the backward algorithm might be similar or even slightly lower than that of the forward algorithm if r is close to n. Even if r is not too close to n, it may still be interesting to use the backward algorithm rather than the forward algorithm because it possesses some nice optimallity properties, which are the subject of the next section. In contrast, the forward greedy algorithm has no such properties, as demonstrated in Section 1. 6

8 3 Main Result Consider the alternative interpretation of the subset selection problem (1). In this interpretation, there exists a sparse vector x 0 such that Ax 0 = b 0, or A(: ;? 0 )x 0 (? 0 ) = b 0 and x 0 (? c 0 ) = 0 where? 0 is the subset of indices of non-zero elements in x 0. The subset selection problem can be viewed as nding the r components that are \active" in x 0 (i.e., nding? 0 ) and estimating their values from a noisy observation of b = b 0 +. That is, the subset selection is a detection and estimation problem with a priori knowledge on the solution in the form of a sparsity constraint. The main result of this paper, presented in Theorem 1, is that, for any full rank matrix A and any sparse vector x 0, there exists a bound on the perturbation that guarantees that the backward greedy algorithm will select the correct subset of components, i.e., ^? s =? 0. This means that the backward greedy algorithm is optimal for the subset selection problem at least for small \noise levels." In order to prove this theorem, we will need the following two lemmas. As usual, we dene the orthogonal distance d(x; S) between a point x and a subspace S as the distance between this point and its orthogonal projection on the subspace. The rst lemma states that the set of points located at equal distance from two subspaces (such as the ones dened by the range spaces of two possible subsets of A) consists of the union of two subspaces. The proof of Lemma 1 is given in the Appendix. Lemma 1. Let S 1 and S 2 be two subspaces of C n, dim(s 1 ) = dim(s 2 ) = r. The set of points located at equal distance from S 1 and S 2, called the bisector of S 1 and S 2 and denoted by H, consists of the union of two subspaces H 1 and H 2 of C n of dimensions dim(h 1 ) = dim(h 2 ) = n?r +dim(s 1 \S 2 ). The second lemma considers one iteration of the backward greedy algorithm, i.e., the removal of one column from the current subset of columns?. It states that there exists a non-trivial bound on the norm of the perturbation kb? b 0 k such that, for any perturbation smaller than the bound, one iteration of the backward greedy algorithm is guaranteed to remove a column that is not part of the correct subset of columns? 0, provided that the current subset of columns? contains the correct subset? 0. Let us rst dene the orthogonal distance d(x; H) between a point x and a bisector H as 7

9 d(x; H) = minfd(x; H 1 ); d(x; H 2 )g. That is, d(x; H) is the distance between the point and the closest of its projections on the two subspaces that make H. Lemma 2. Let ^? 2 arg min () (4)? c()=c(?)?1 where () is dened in (2) and?. Denote by? 0 the subset of indices associated with the sparse solution x 0 in (1). If?? 0, then there exists > 0 such that kb? b 0 k <, kb 0 k > 0, implies ^?? 0. The value of is given by = min?;6? 0 c( )=c(?)?1 where H(; ) denotes the bisector of R(A(: ; )) and R(A(: ; )). min d(b 0 ; H(; )); (5)?;? 0 c()=c(?)?1 Proof. Let H(; ), f1; : : : ; ng, f1; : : : ; ng, c() = c(), denote the bisector of R(A(: ; )) and R(A(: ; )), that is, the union of the two hyperplanes of points located at equal distance from R(A(: ;?)) and R(A(: ; )) as given in Lemma 1. Let be the radius of the largest ball V (b 0 ) centered at b 0 2 R(A(: ;? 0 )) that does not intersect any of the bisectors H(; ), 6? 0,?,? 0,?. That is, is given by (5). Clearly, all the points in V (b 0 ) are closer to at least one of the R(A(: ; )) for some? 0 than they are to any of the R(A(: ; )) for all 6? 0. Thus, if jb? b 0 j <, we are guaranteed to select ^? among the 's, which means ^?? 0 as desired. All that remains to show now is that > 0. This comes as a consequence of the linear independence of the columns of A. Suppose = 0. Then there exist 6? 0 and? 0 such that d(b 0 ; H(; )) = 0, i.e., b 0 2 H(; )). Because, by the denition of H(; ), d[b 0 ; R(A(: ; ))] = d[b 0 ; R(A(: ; ))] and, by assumption, b 0 2 R(A(: ; )) this implies b 0 2 R(A(: ; )) \ R(A(: ; )). Given that kb 0 k > 0, this would contradicts the hypothesis that A is full rank because 6? 0 while? 0. Equipped with Lemma 2, we can now state our main theorem. Theorem 1. For any full column rank matrix A and for any sparse vector x 0 with r non-zero components (or, alternately, for any b 0 = Ax 0 ), there exists > 0 such that kb?b 0 k < guarantees that the backward greedy algorithm for solving Ax = b will select the correct subset of components 8

10 ? 0. The value of is given by = min rk<n min ;6? 0 c( )=k min d(b 0 ; H(; )): (6) ;? 0 c()=k Furthermore, the corresponding sparse least-squares solution ^x s satises k^x s?x 0 k = min [A(: ;? 0 )] where min [A] denotes the smallest singular value of A. Proof. The rst part of the theorem can be proven by an inductive argument based on Lemma 2. Let? (k) denote the subset of columns of A selected at step k of the backward greedy algorithm, k = n; n? 1; : : : ; r. The base for the induction is the observation that? (n)? 0 since? (n) = by denition of the backward greedy algorithm. The induction step consists in showing that if? (k+1)? 0, then? (k)? 0 provided that kb?b 0 k < (k) for some (k) > 0, for i = n?1; n?2; : : : ; r. The induction step follows directly from Lemma 2. The bound (k) is given by (5) with? =? (k+1). It follows that the backward greedy algorithm will select the correct subset of columns of A, i.e., ^? s =? 0, if kb? b 0 k < with = min rk<n (k), the last relation yielding the bound (6). The last part of the theorem follows directly from k^x s? x 0 k = ka y (:;? 0 )(^b? b 0 )k ka y (:;? 0 )k; where ^b = A(:;? 0 )^x s and A y denotes the pseudo-inverse of A. Our specic motivation for studying the backward greedy algorithm for subset selection arises from an estimation and detection problem in statistical signal processing [3, 2]. In this application, the vector b is a random vector converging with probability one to b 0. We are interested in the properties of the estimate of? 0 and x 0 obtained by the backward greedy algorithm. Note that for the application described in [3, 2], nding? 0 is at least as important as nding x 0. Using Theorem 1, it is easy to establish that the estimates ^? s and ^x s obtained by the backward greedy algorithm are strongly consistent. That is, they converge to the true values? 0 and x 0 with probability one. To see this, note that the backward greedy algorithm is guaranteed to pick the correct subset of components if the perturbation is small enough, and recall that k^x s?x 0 k kb?b 0 k= min [A(:;? 0 )]. We get thus directly the following optimal detection corollary. Corollary 1. If b is a strongly consistent estimator of b 0, i.e., if b! b 0 w.p.1, then ^? s!? 0 and ^x s! x 0 w.p.1. 9

11 So far it has been assumed that a \true" sparse solution x 0 existed and that the subset selection problem consisted in nding this sparse solution. This \detection and estimation" view led to Theorem 1 and its rst corollary. There are however situations in which this formulation is not adequate: there is no underlying \true" solution and the subset selection problem is simply that of nding the subset of r columns from A such that the least-square residual is minimized. In this case, let? s and x s be dened as the optimal sparse solution. That is, let? s = arg min (?) (7)? c(?)=r x s = arg min kax? bk: (8) x2c n x(? c s )=0 In practice,? s and x s can always be found by exhaustive search. Recall that ^? s and ^x s are the solutions provided by the backward greedy algorithm. The following corollary of Theorem 1 implies that, provided that the residual (^? s ) is small enough, the backward greedy algorithm will give the same solution as an exhaustive search. Corollary 2. The backward greedy algorithm is optimal, i.e., ^? s =? s and ^x s = x s, provided that (^? s ) is small enough; that is, provided that (^? s ) < where is dened by (6) in which b 0 = A^x s and? 0 = ^? s. Proof. By hypothesis, kb? A^x s k <. From the denition of, this implies that b is closer to R(A(: ; ^? s )) than to R(A(: ;?) for any subset? with r or fewer components since b is on the \^? s - side" of all the bisectors H(^? s ;?). Hence, d(b; R(A(: ; ^? s ))) < d(b; R(A(: ;?))) or (^? s ) < (?), which implies ^? s =? s and ^x s = x s. Using Corollary 2, a procedure for checking if the output of the backward greedy algorithm ^? s is the optimal sparse solution? s can be suggested: simply compare (^? s ) to given by (6). If (^? s ) <, then ^? s =? s. Unfortunately, computing from (6) with ^? s, A, and b is not practical. So, this procedure is of limited usefulness in general and the implication of Corollary 2 is more of qualitative nature. The implication of Corollary 2 is that the backward greedy algorithm is optimal for the subset selection problem when its solution corresponds to a small enough residual. That is, if (? s ) is small enough then? s is guaranteed to be found by the algorithm. For larger residuals, the backward 10

12 greedy algorithm is sub-optimal; it may or may not nd the correct solution, depending on the values of A and b. 4 Numerical Results Figure 1 illustrates Theorem 1 and Corollary 2. A 10 7 matrix A with real random i.i.d. Gaussian coecients was generated, 2?0:54?0:10 0:44?0:39?1:71?1:16?0:06?1:77 0:80?1:38 0:73 0:29 0:59?0:25 0:08?0:60 0:78 0:78?1:12 0:64 1:20 0:28 2:25 1:64 0:65 1:33 1:73?1:91?1:20?0:15?0:70 1:41?0:82 0:59 0:43 A = 1:25 0:12 0:82 0:79 0:57?0:79 0:90 1:22 1:19 1:24 1:21 0:18 0:36 0:85?0:03 0:23?1:61?1:33?0:91?0:98 0:71 6 0:48?0:09?1:51 0:04?2:26 0:04?0:06 4 0:14?0:88 0:62?0:23?0:05 0:03 0: : The rst three columns of the matrix were linearly combined to yield the vector b 0, i.e., b 0 = Ax 0 with x 0 = [1; 1; 1; 0; 0; 0; 0] H. One hundred random perturbation vectors were generated and added to b 0 to yield b = b 0 +. The perturbation vectors were uniformly distributed on a hypersphere of given radius, i.e., the norm kk of the perturbation vectors was xed. For each perturbation vector, the sparse least-square solution to Ax = b with r = 3 non-zero elements was computed by the backward greedy algorithm and by exhaustive search. The experiment was repeated for several values of kk. Figure 1 gives the number of times the backward greedy algorithm found the true subset of of non-zero components? 0 = f1; 2; 3g as a function of the ratio kk=kb 0 k (i.e., of the inverse of the \signal-to-noise" ratio). Figure 1 also gives the number of times the sparse solution obtained by exhaustive search (? s ) was equal to? 0 or to ^? s (recall that for large perturbations,? s does not necessarily need to be equal to? 0 ). According to Theorem 1 and Corollary 2, we should expect ^? s =? 0 =? s when kk <. This can indeed be veried on Figure 1 where the value of corresponding to A and b 0 is marked by a circle. The value of was computed 11

13 from (6). The distance between b 0 and the bisectors of the subspaces corresponding to subsets of columns of A in the rightmost part of (6) can be computed by the method suggested at the end of the Appendix. The value of depends on the matrix A and the \exact" vector b 0. Two general observations can be made about. First, it is easily veried that should be proportional to kb 0 k. Second, recall that can be viewed as the radius of the largest hypersphere centered on the sparse vector b 0 that can be tted in the n-dimensional cone including b 0 whose facets are the bisectors of pairs of subspaces spanned by subsets of columns of A in (6). Intuitively, when the principal angles between these subspaces are small, the principal angles between the bisectors will also be small, the n-dimensional cone will be \narrower," and will be smaller. The principal angles between subspaces spanned by subsets of columns of A will be small when these columns are nearly linearly dependent, i.e., when the condition number of the matrix A is large. One would thus expect to see the ratio =kb 0 k to vary inversely with the condition number of the matrix A. In order to verify this hypothesis, two hundred 10 7 matrices with random coecients were generated. By manipulating the singular values of the matrices, the condition number (A) was set to 5 for the rst 100 matrices and to 50 for the next 100 matrices. For each matrix, the vector b 0 was again computed from the rst three columns of A as b 0 = A[1; 1; 1; 0; 0; 0; 0] H, and the bound was evaluated from (6). The histograms of Figure 2 gives the distributions of obtained for (A) = 5 and (A) = 50. It can be veried that =kb 0 k is generally higher when the condition number of the matrix is small. Given A and b 0, a method for computing an estimate the bound that does not requires the exhaustive evaluation of the distances between b 0 and bisectors in (6) can be suggested. This method can be used when the matrix A has a large number of columns. Using the same approach as in the rst experiment, random perturbation vectors kk can be added to b 0 and the backward greedy algorithm can be used to compute ^? s for increasing values of kk. When a perturbation vector is found such that ^? s 6=? 0, Theorem 1 implies that kk. If a large number of random vectors are generated for each value of kk, this Monte Carlo-type method can yield tight approximations of. Since it is based on the backward greedy algorithm, the method can yield an estimate of to any desired condence level in polynomial time. 12

14 5 Remarks 5.1 Determination of the Number of Components r In the variant of the subset selection problem considered here, the number of desired non-zero components r was supposed known a priori. It is possible to consider other formulations of the problem. For example, in [10], Natarajan states the following version of the subset selection problem (\Natarajan's problem" in the sequel): Given A, b, and > 0, nd the vector x satisfying kax? bk 2 if such exists, such that x has the fewest non-zero entries of all such vectors. In the version considered in [2], neither nor r are known a priori. Instead, the subset selection problem is dened as nding the sparse vector x that realizes the best trade-o between the match to the observation vector b and the number of non-zero components. This trade-o is measured by a gure of merit of the form min x fkax? bk 2 + f(x)g ; (9) where f(x) is a complexity penalty monotonously increasing with the number of non-zero components in x. Note that because (A(: ;? 0 )) < (A), a sparsity constraint reduces the sensitivity of the solution of the linear inverse problem Ax = b to perturbations of the observation vector b 0, at least for small enough ( < ). For this reason, solving a linear inverse problem with a sparsity constraint like in (9) is sometimes known as \regularization by sparsity" [5]. The backward greedy algorithm presented in Section 2 can be readily applied to the above problems by solving for successively decreasing r and looking for the sparsest solution satifying kax? bk < in the case of Natarajan's problem, or the minimizer of the gure of merit (9) in the case of regularization by sparisity. 5.2 NP-Hardness of Subset Selection In [10], Natarajan considered the variant of the subset selection problem described in Section 5.1. He showed that nding the sparsest solution to kax? bk for a given, if such solution exists, is NP-hard. This may seem to be in contradiction with our result that the backward greedy algorithm can solve correctly the subset selection problem in polynomial time in the small residual case. Indeed, for Natarjan's formulation of the subset selection problem, we have the following 13

15 corollary to our main result. Corollary 3. Let 1 ( 1 ), and? 0 () and r() be the residual, component indices, and their number, respectively, in the solution to the subset selection problem (Natarajan's Problem) for given A of full column rank, b, and, if the solution exists. Suppose 1 <, where is given by (6) with r = r() and? 0 =? 0 (). Then, the backward greedy algorithm will provide the optimal solution ^? s =? 0 () if stopped at the smallest r satisfying (^? s ). Proof. The proof follows directly from the formulation of the subset selection by Natarajan and from Corollary 2. Corollary 3 states that if the threshold is small enough, the solution to Natajan's problem will be found in polynomial time by the backward greedy algorithm. The contradiction between Natarjan's result on NP-hardness of the subset selection problem and the last statement is only apparent. Indeed, assuming that the backward greedy algorithm yields a solution ^? s complying with kax? bk for a given, it would still be necessary to compute the corresponding bound from (6) to verify that <, which is not a polynomial-time operation. Thus, even if the backward greedy algorithm yields a solution in polynomial time which can be expected to be the optimal solution if the residual is small enough, the verication that the residual is indeed \small enough" cannot be performed in polynomial time. 5.3 Extensions of the Results The results that have been presented in the previous sections can be extended to variants of the subset selection problem based on other metrics than the Euclidean distance. In general, the subset selection problem can be stated as nding the subset of r columns from A that gives the best approximation of b in the sense that d(ax; b) is minimized over all vectors x with at most r non-zero components for a given distance measure d(; ) dened on C m. Let the distance between a point x and a set of points S be dened as d(x; S) = min y2s d(x; y) and let H(; ) = fx : d [x; R(A(: ; ))] = d [x; R(A(: ; ))]g denote the bisector of R(A(: ; )) and R(A(: ; )). It is then easily veried that Lemma 2 and the rst part of Theorem 1 remain valid mutatis mutandis. 14

16 Appendix. Proof of Lemma 1 Proof. Let P 1 and P 2 denote the projection matrices associated with S 1 and S 2, respectively. The bisector H of S 1 and S 2 is dened by H = fx : kx? P 1 xk = kx? P 2 xkg: Let W = S 1 \ S 2. If W 6= f0g, it is always possible to write S 1 = W S1 ~ and S 2 = W S2 ~ with ~S 1 \ S2 ~ = f0g and s = dim( Si ~ ) = dim(s i )? dim(s 1 \ S 2 ), i = 1; 2. Let P W, P1 ~, and P2 ~ be the projection matrices associated with W, S1 ~, and S2 ~, respectively. Applying Pythagoras' theorem to the projections of x on the subspaces S i, Si ~, and W, we have kx? ~ Pi xk 2 = kx? P i xk 2 + kp i x? ~ Pi xk 2 = kx? P i xk 2 + kp W xk 2 for i = 1; 2. Hence, the bisector of S 1 and S 2 is also the bisector of ~ S1 and ~ S2, that is, H = fx : kx? ~ P1 xk = kx? ~ P2 xkg: By the indempotence and Hermitian symmetry properties of projection matrices, kx? P1 ~ xk 2 = kx? P2 ~ xk 2 is equivalent to x H ( P1 ~? P2 ~ )x = 0: Let U i = (u i 1 ; : : : ; ui s) be a m r matrix whose columns form an orthonormal basis for Si ~. We have 0 = x H ( ~ P1? ~ P2 )x = x H (U 1 U H 1? U 2U H 2 )x = x H (U 1? U 2 )(U 1 + U 2 ) H x; 8x 2 H: Thus, the bisector H consists of the union of the two subspaces dened by (U 1 + U 2 ) H x = 0 and (U 1? U 2 ) H = 0. In other terms, H = H 1 [ H 2 ; where H 1 = N ((U 1 + U 2 ) H ); H 2 = N ((U 1? U 2 ) H ); 15

17 and where N (A) denotes the nullspace of A. It also follows directly that dim[h 1 ] = dim[h 2 ] = m? s = m? r + dim(s 1 \ S 2 ); which concludes the proof. The derivation above suggests a method for computing the distance d(x; H) between a vector x and a bisector H. This distance can be easily computed from the orthonormal bases U 1 and U 2 by recalling that the range R(A) is the orthogonal complement of N (A H ). It follows that d(x; H 1 ) = jp R(U1 +U 2 )xj where P R(U1 +U 2 ) is the orthogonal projection on R(U 1 + U 2 ), which can be easily computed, e.g., by QR factorization of U 1 + U 2. Likewise d(x; H 2 ) = jp R(U1?U 2 )xj can be easily computed. The orthonormal bases U 1 and U 2 can be obtained from the principal vectors between the subspaces S 1 and S 2 [4]. The latter can be obtained from any pair of orthonormal bases for S 1 and S 2 via a SVD. References [1] A. Bjorck, H. Park, and L. Elden, Accurate downdating of least squares solutions, SIAM Journal on Matrix Analysis and Applications, 15 (1994), pp. 549{568. [2] C. Couvreur and Y. Bresler, Dictionary-based decomposition of linear mixtures of Gaussian processes, in Proceedings IEEE International Conference on Acoustic, Speech, and Signal Processing, Atlanta, GA, May 1996, pp. 2519{2522. [3], Optimal decomposition and classication of linear mixtures of ARMA processes, submitted, Available at ftp://thor.fpms.ac.be/pub/couvreur/biblio/sarma.ps.gz. [4] G. H. Golub and C. F. Van Loan, Matrix Computations, The Johns Hopkins University Press, Baltimore, ML, second ed.,

18 [5] G. Harikumar and Y. Bresler, A new algorithm for computing sparse solutions to linear inverse problems, in Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 3, Atlanta, GA, may 1996, pp. 1331{1334. [6] A. J. Miller, Subset Selection in Regression, Chapman and Hall, London, UK, [7] M. Nafie, M. Ali, and A. H. Tewfik, Optimal susbet selection for adaptive signal representation, in Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 4, Atlanta, GA, May [8] G. M. Furnival and R. W. Wilson, Jr., Regression by Leaps and Bounds, Technometrics, 16 (1974), pp. 499{511. [9] A. H. Feiveson, Finding the Best Regression Subset by Reduction in Nonfull-Rank Cases, SIAM Journal on Matrix Analysis and Its Applications, 15 (1994), pp. 194{204. [10] B. K. Natarajan, Sparse approximate solutions to linear systems, SIAM Journal on Computing, 24 (1995), pp. 227{

19 δ / b 0 = % Backward = True Backward = Exhaustive Exhaustive = True b b / b Figure 1: Performance of the backward greedy algorithm as a function of the ratio kb? b 0 k=kb 0 k. The value of the bound is represented by a circle. 18

20 (a) (b) 40 Γ 0 = { }, κ(a) = Γ 0 = { }, κ(a) = % 20 % δ / b δ / b 0 Figure 2: Distribution of the bound for 10 7 random matrices with given condition number: (a) (A) = 5 and (b) (A) =

Linear Systems. Carlo Tomasi. June 12, r = rank(a) b range(a) n r solutions

Linear Systems. Carlo Tomasi. June 12, r = rank(a) b range(a) n r solutions Linear Systems Carlo Tomasi June, 08 Section characterizes the existence and multiplicity of the solutions of a linear system in terms of the four fundamental spaces associated with the system s matrix