Online Adaptive Estimation of Sparse Signals: where RLS meets the l 1 -norm

Size: px

Start display at page:

Download "Online Adaptive Estimation of Sparse Signals: where RLS meets the l 1 -norm"

Deirdre Norman
5 years ago
Views:

1 IEEE TRASACTIOS O SIGAL PROCESSIG (TO APPEAR) Online Adaptive Estimation of Sparse Signals: where meets the l -norm Daniele Angelosante, Member, IEEE, Juan-Andres Bazerque, Student Member, IEEE Georgios B. Giannakis, Fellow, IEEE Abstract Using the l -norm to regularize the least-squares criterion, the batch least-absolute shrinkage and selection operator (Lasso) has well-documented merits for estimating sparse signals of interest emerging in various applications where observations adhere to parsimonious linear regression models. To cope with high complexity, increasing memory requirements, and lack of tracking capability that batch Lasso estimators face when processing observations sequentially, the present paper develops a novel -weighted Lasso (TWL) approach. Performance analysis reveals that TWL cannot estimate consistently the desired signal support without compromising rate of convergence. This motivates the development of a - and norm-weighted Lasso (TWL) scheme with l -norm weights obtained from the recursive least-squares () algorithm. The resultant algorithm consistently estimates the support of sparse signals without reducing the convergence rate. To cope with sparsity-aware recursive real- processing, novel adaptive algorithms are also developed to enable online coordinate descent solvers of TWL and TWL that provably converge to the true sparse signal in the -invariant case. Simulated tests compare competing alternatives and corroborate the performance of the novel algorithms in estimating -invariant signals, and tracking -varying signals under sparsity constraints. I. ITRODUCTIO Sparsity is an attribute present in a plethora of natural as well as man-made signals and systems. This is reasonable not only because nature itself is parsimonious but also because processing and simple models with minimal degrees of freedom are attractive from an implementation perspective. Exploitation of sparsity is critical in applications as diverse as variable selection in linear regression models for diabetes [24], image compression [5], distributed spectrum sensing for cognitive radios [3], estimation of wireless multipath channels [9], [23], and signal decomposition using overcomplete bases [8]. The notion of variable selection (VS) associated with sparse linear regression [24] is the cornerstone of the emerging area of compressive sampling (CS) [4], [5], [8]. VS is a combinatorially complex task closely related (but not identical) to the well-known model order selection problem tackled through Akaike s information [], Bayesian information [20], and risk inflation [] criteria. A typically convex function of Submitted September 2, Revised January 20, 200. Accepted March 9, 200. Parts of this paper were presented at the IEEE Intl. Conf. on Acoust., Speech and Signal Proc., Taipei, Taiwan, April 2009; and at the IEEE Workshop on Stat. Signal Proc., Cardiff, Wales, UK, August The authors are with the Dept. of Electrical and Computer Engr. at the Univ. of Minnesota, Minneapolis, M 55455, USA; ( s: {angel29, bazer002, georgios}@umn.edu) the model fitting error is penalized with the l 0 -norm of the unknown vector which equals the number of nonzero entries, and thus accounts for model complexity (degrees of freedom). To bypass the non-convexity of the l 0 -norm, VS and CS approaches replace it with convex penalty terms (e.g., the l - norm) that capture sparsity but also lead to computationally efficient solvers. Research on CS and VS has concentrated on batch processing, and various algorithms for sparse linear regression are available. Those include the basis pursuit and Lasso [8], [24], the Dantzig selector [6], and the l 2 -norm constrained l -norm minimizer [4]. CS and VS estimators are nonlinear functions of the available observations which they process in a batch form using iterative algorithms. However, many sparse signals encountered in practice must be estimated based on noisy observations that become available sequentially in. For such cases, batch signal estimators typically incur complexity and memory requirements that grow as progresses. In addition, the sparse signal may vary with both in its nonzero support, as well as in the values of its nonzero entries. To cope with these challenges, the present paper develops adaptive algorithms for recursive estimation and tracking of (possibly -varying) sparse signals based on noisy sequential observations adhering to a linear regression model. Adaptive estimation of sparse signals has received little consideration so far. Sequential noise-free signal recovery was considered in [8], and a sparsity-aware least mean-square (LMS) algorithm was pursued in [4]. Sparsity-aware like algorithms are reported in [2], while an effort to combine Kalman filtering and compressed sensing can be found in [26]. After preliminaries, the problem is stated in Section II. Section III deals with two pseudo-real adaptive Lasso algorithms: the -weighted Lasso (TWL), and the and norm-weighted Lasso (TWL). The novel TWL replaces the l 2 -norm regularizing term of the algorithm with the l -norm that encourages sparsity. It turns out that if the sparse signal vector is -invariant, the TWL cannot estimate the signal support consistently while at the same ensuring convergence comparable to. To overcome this shortcoming, the TWL introduces a weighted l -norm regularization, where the weights are obtained from the algorithm. TWL estimates consistently the support of the sparse signal vector at converge rate identical to that of. Since TWL and TWL estimates are not available in closed form, a sequence of convex programs has to be solved as new data are acquired, which is not desirable for real- applications. Recent advances in sparse linear regression have

2 IEEE TRASACTIOS O SIGAL PROCESSIG (TO APPEAR) 2 showed that the coordinate descent approach provides an efficient means of solving Lasso-like problems since it is fast and numerically stable [3], [27]. This motivates the present paper s novel, low-complexity online coordinate descent algorithms developed in Section IV. Partial coordinate-wise updates have been considered for adaptive but sparsity-agnostic processing to lower the computational complexity of LMS and [2], [28]. On top of lowering the computational burden, online coordinate descent is well motivated for sparsity-aware adaptive processing because the scalar coordinate estimates become available in closed form. Corroborating simulations are presented in Section V both for -invariant and varying sparse signals. Conclusions are drawn in Section VI. otation. Column vectors (matrices) are denoted with lowercase (upper-case) boldface letters and sets with calligraphic letters; ( ) T stands for transposition, and ( ) for the Moore- Penrose pseudo-inverse. The pth entry of vector x is denoted as x(p), and the (p,q)th entry of matrix R as R(p,q). The function (µ,σ 2 ) stands for the Gaussian probability density function with mean µ and variance σ 2 ; the l - and l 2 -norms of x R P are denoted as x := P p= x(p) and x 2 := P p= x(p) 2, respectively. II. PRELIMIARIES AD PROBLEM STATEMET Consider a vector x o R P which is sparse, meaning that only a few of its entries x o (p), p =,...,P, are nonzero. Let S xo := {p : x o (p) 0} denote its support, P := S xo the number of non-zero entries, and P 0 := P P. Sparsity amounts to having P P 0. Suppose that such a sparse vector is to be estimated sequentially in from scalar observations obeying the linear regression model y n := h T nx o + v n, n =,..., () where h n R P is the regression vector at n, and the additive noise v n is assumed uncorrelated with h n, white, with mean zero, and variance σ 2. The goal of this paper is to develop sequential and adaptive estimators of x o that is a priori known to be sparse, and perhaps slowly varying with n. The least-squares (LS) criterion is the workhorse for linear regression analysis [9, p. 658]. If y := [y,...,y ] T and H := [h,...,h ] T, the LS estimator of x o at solves the minimization problem x LS := arg min x R P y H x 2 2. (2) If < P or H is not full column rank, the problem in (2) does not admit a unique solution. For such cases, minimizing also the l 2 -norm of x renders the LS solver unique, and expressible as x LS = H y, where denotes matrix pseudoinverse defined as in e.g., [5, p. 275]. In the sequential context considered herein, LS faces three challenges: i) increasing memory requirements for storing y and H as grows large; ii) complexity of order O(P 3 ) per instant n to perform the inversion in H ; and iii) lack of capability to track possible variations of x o with n. These challenges are met by the recursive least-squares () estimator obtained as [9, Chap. 2] x := arg min x R P β,n (y n h T nx) 2, =,2,... (3) n= where the so-called forgetting factor β,n describes one of the following data windowing choices: (w) Infinite window with β,n =. This choice is adopted for -invariant signals and, with proper initialization renders equivalent to LS at complexity O(P 2 ) per datum. (w2) Exponentially decaying window with β,n = β n, and 0 β <. With this choice, downweighs old samples, and can track -varying signals. (w3) Finite window with β,n = if n M and β,n = 0 otherwise. Here, only the most recent M samples are utilized to form x while the rest are discarded. The estimator in (3) can be expressed recursively in terms of x. Supposing that {h n} P n= are linearly independent, setting β,n =, and initializing this recursion with the LS solution for = P, that is x = xls P, the coincides with the LS for successive instants > P, provided that x o remains invariant [9, p. 740]. For < P or when {h n } P n= are linearly dependent, the estimator can be regularized by augmenting the LS cost with a scaled l 2 -norm of x [9, p. 739]. Specifically, the regularized is x := arg min x R P [ β,n (y n h T nx) 2 +γ x 2 2 n= ], =,2,... (4) where γ > 0 is a pre-selected decreasing function of that depends on the selected window and its effect vanishes for large. Clearly, for γ = 0 the regularized in (4) reduces to the ordinary one in (3), but both do not exploit the sparsity present in x o. Sparse linear regression is a topic of intense research in the last decade and Lasso is one of the most widely applied sparsity-aware estimators [5], [24]. The Lasso estimator is given by x Lasso [ ] := arg min x R P 2 y H x x. (5) Thanks to the scaled l -norm, the cost encourages sparse solutions [24]: the larger the chosen is, the more components are shrunk to zero. Interestingly, the Lasso performs well in sparse problems also when < P, and the convex l - norm regularization which can afford efficient solvers when optimizing (5) given batch data, performs similarly to its nonconvex l 0 -norm counterpart [4]. The question that arises is how the l -norm regularization can be effectively utilized in adaptive signal processing. Specifically, given {y n,h n } n=, we wish to develop recursive schemes to estimate the sparse signal of interest with: i) minimal memory requirements; ii) tracking capability; and iii) limited complexity.

3 IEEE TRASACTIOS O SIGAL PROCESSIG (TO APPEAR) 3 III. ADAPTIVE PSEUDO-REAL TIME LASSO Motivated by (3), a -weighted Lasso (TWL) approach emerges naturally to endow the batch Lasso in (5) with ability to handle sequential processing. Specifically, the proposed TWL estimator is x TWL := arg min x R P J TWL (x), =,2,... (6) where J TWL (x) := 2 n= β,n(y n h T nx) 2 + x. In addition to windowing, note that is now allowed to vary with. eglecting constant terms, the cost function in (6) can be re-written as where r := J TWL (x) = 2 xt R x x T r + x (7) β,n y n h n, R := n= β,n h n h T n. (8) n= Due to data windowing, r and R can be updated recursively as [cf. (w)-(w3)] (w) : r = r + y h, R = R + h h T (9a) (w2) : r = βr + y h, R = βr + h h T (9b) (w3) : r = r + y h y M h M, R = R + h h T h M h T M. (9c) Relative to the batch Lasso in (5), any of the TWL updates in (9) offers memory savings. Clearly, choices (w2) and (w3) allow also the signal of interest to vary slowly with. With respect to in (4), the TWL estimator inherits the properties brought by the l -norm, namely sparsity awareness and ability to deal with under-determined systems ( < P ). Summarizing, the attractive features of TWL are: i) Reduced memory requirements with respect to batch Lasso. ii) Improved performance relative to when x o is sparse and -invariant; and iii) Enhanced tracking capability when x o is sparse and -varying, relative to batch Lasso and for windows of size less than the dimension P of x o. Despite these attractive features, the main limitation of TWL is that a convex program has to be solved per. While the cost is differentiable, and thus amenable to closedform minimization, J TWL (x) is not. However, initializing the convex program at with the solution x TWL at provides a warm start-up, which speeds up convergence to the optimum x TWL. For these reasons, TWL is a pseudo-real algorithm. Low-complexity real- algorithms will be developed in Section IV. But for now, it is worth checking TWL for consistency. A. (In) Consistency of the TWL estimator Since the non-zero support of x o is unknown, and sparse vector estimators are nonlinear functions of the data not expressible in closed form, performance analysis is distinct from and far more challenging than that of LS estimators. Consider for simplicity that x o is -invariant for which (w) is prudent to adopt, and suppose that the regressors and noise satisfy these regularity (ergodicity) conditions: (r) lim n= h nh T n = R with probability (w.p.), with R positive definite; and (r2) lim n= v nh n = r vh w.p.. If the noise v n and the regressors {h n } are mixing, which is the case for most stationary processes with vanishing memory in practice, then (r) and (r2) are readily met. Since v n in () is zero mean and uncorrelated with h n, the cross-covariance in (r2) vanishes. With r vh = 0, it follows readily from () that r := lim n= y nh n = R x o w.p.. Upon dividing both sides of (7) by and taking limits, (r) and (r2) then imply that lim (/)J TWL(x) = (/2)xT R x x T r := J (x) w.p., if is chosen to grow slower than. In this case, as it holds that x TWL = arg min x J TWL(x) arg min x J (x) := R r = x o w.p.. This proves the following result. Proposition. For the model in () with (r), (r2), and (w) in effect, the TWL estimator is strongly consistent, provided that is chosen to satisfy lim = 0. At this point it is instructive to recall that under the conditions of Proposition, the LS estimator x LS = R r also converges w.p. to x o = R r, and is thus strongly consistent [6]. For this reason, to asses performance of TWL and differentiate it from that of LS it is pertinent to consider sufficiently large (but preferably finite) for which the standard sparsity-agnostic LS is unable to accurately estimate the zero entries of x o. It is thus of interest to check whether TWL can estimate jointly the nonzero support and the nonzero entries of x o consistently for sufficiently large. To this end, suppose that the first P entries of x o are non-zero; i.e., S xo := {,...,P }; and partition accordingly the R matrix as ( ) R R R = 0. R 0 R 00 Following the definitions in [0], support consistency amounts to having lim Prob [S x = S xo ] = (0) and -estimation consistency requires convergence in distribution ( d ), that is ( x P xp o ) d (0 P,σ 2 R ) () where x P denotes the P vector obtained by extracting the first P components of x. Properties (0) and () are referred to as oracle properties because a sparsity-aware estimator possessing these properties is asymptotically as good as if the support S xo was known in advance [0]. Under (w), the TWL corresponds to a sequential version of the batch Lasso estimator. Hence, asymptotic properties of

4 IEEE TRASACTIOS O SIGAL PROCESSIG (TO APPEAR) 4 the latter derived in [6], [29] carry over to the TWL estimator introduced here. Lemma. (See also [29, Prop. ]). For the model in () with (r), (r2), and (w) in effect, if lim = 0 0, [ ] then lim Prob S x TWL = S x o = c( 0 ) < with c( 0 ) denoting an increasing function of 0. In words, Lemma asserts that if grows as, support consistency cannot be achieved. Since c( 0 ) increases with 0, the hope for the TWL to satisfy the oracle properties is left for cases wherein grows faster than. Unfortunately, the next result discourages this. Lemma 2. (See also [29, Lemma 3]). For the model in () with (r), (r2), and (w) in effect, if lim = and lim = 0, then lim ( x TWL x o) = C, where C is a non-random constant. Lemma 2 states that if grows faster than but slower than, the rate of convergence is, that is slower than ; hence, ( x TWL x o) diverges. Combining Lemmas and 2, the following negative result holds for batch Lasso and thus for TWL. Proposition 2. For the model in () with (r), (r2), and (w) in effect, the TWL estimator can not achieve the oracle properties for any choice of. Before exploring alternatives to TWL that satisfy the oracle properties, one remark is in order. Remark. If instead of the convex l -norm the LS cost is regularized with suitably chosen non-convex functions of x, it is possible to construct sparsity-aware estimators that asymptotically possess the oracle properties [0]. Of course, the price paid is inefficient optimization due to non-convexity. These considerations motivate searching for convex regularizing terms which result in pseudo real- Lasso estimators satisfying the oracle properties. Such a class is developed next using the weighted l -norm regularization, introduced in [29], [30] for the batch Lasso. B. Time- and norm-weighted Lasso Let u( ) denote the step function, {µ } a positive sequence dependent on the sample size, and a > a constant tuning parameter. Based on these, define the weight function w µ ( ) : R + [0,] as w µ (x) := [aµ x] + (a )µ u (x µ ) + u (µ x) (2) where [α] + := max(α,0). Using w µ, the novel - and norm-weighted Lasso (TWL) estimator weighs the l -norm with coefficients depending on the entries of x ; that is, [ x TWL := arg min x R P 2 p= β,n (y n h T nx) 2 (3) n= P ( + w µ x (p) ) ] x(p), =,2,... Figure shows the weight function w µ (x) for µ = 0.2 and a = 3.7. otice that while in TWL weighs identically all summands x(p) in the l -norm, the TWL estimator places higher weight to small entries, and lower weight to entries with large amplitudes. In fact, estimates of size less than µ are penalized as in TWL, while estimates between µ and aµ are penalized in a linearly decreasing manner. Finally, estimates larger than aµ are not penalized at all (cf. Fig. ). It is worth recalling at this point that albeit sparsity-agnostic, the estimator is -estimation consistent [6], that is ( x x o ) d (0,σ 2 R ). (4) Based on (4), it is possible to establish the following result. Proposition 3. (See also [30, Theorem 4]). For the model in (), with (r), (r2), and (w) in effect, if lim =, lim = 0 and µ =, the TWL estimator satisfies the oracle properties (0) and (). Weighted l -norm regularization was introduced in [7], [29], and [30] using different weight functions to effect sparsity and satisfy the oracle properties of the batch weighted Lasso estimator. The weight function in (2) corresponds to the local linear approximation of the smoothly clipped absolute deviation regularizer introduced by [30]. The difference here is its coupling with to ensure consistency of the novel adaptive TWL estimator. ext, the implications of Propositions 2 and 3 are demonstrated through simulated tests. C. umerical examples Gaussian observations are generated according to () with a -invariant x o, P = 30, P = 3, v n (0,σ 2 ), σ 2 = 0, h n (0 P,I P ) and infinite windowing as in (w). The penalty scale is set to = 2σ 2 log P for the TWL (see also [29]), and = 2σ log P for TWL with µ = and a = 3.7. The first three entries of x o are chosen equal to unity, and all other entries are set to zero. Figure 2 depicts the mean-square error (MSE), E[ x x o 2 ], across for the TWL, TWL, and along with what is termed genie-aided (GA), which knows in advance the support and performs standard to estimate the nonzero components. The convex optimization problem per is solved using the SeDuMi package [22] interfaced with Yalmip [7]. Observe that while TWL outperforms, it is outperformed by TWL, whose performance approaches that of the GA- benchmark. Indeed, the TWL does achieve the oracle properties in the considered simulation setting. ext, Gaussian observations are generated according to () with a -varying x o (henceforth denoted as x n ), and parameters P = 30, P = 3, v n (0,σ 2 ), σ 2 = 0, and h n (0 P,I P ). A Gauss-Markov model is assumed for x n ; that is, x n (p) = αx n (p) + w n (p) with x 0 (p) (0,), α = 0.99, and w n (p) ( 0, α 2) for p =,2,3. Without loss of generality, (w2) is adopted with β = 0.9, and = 2σ 2 log P n= β2( n) for both TWL and

5 IEEE TRASACTIOS O SIGAL PROCESSIG (TO APPEAR) 5 TWL and µ =. Clearly, in a -varying n= setting these estimators are β n not expected to achieve the consistency properties established when x o remains -invariant. Figure 3 depicts the squared error (SE) for a realization of the, GA-, TWL and TWL. In the considered setting, TWL and TWL perform similarly and both outperform while approaching the performance of the GA- benchmark. ext, Gaussian observations are generated according to () with a -varying x n, and parameters P = 30, P = 3, v n (0,σ 2 ), σ 2 = 0 2, and h n (0 P,I P ). A Gauss- Markov model is assumed for x n with α = 0.995, and (w3) windowing is adopted. For brevity, only the regularized in (4) with constant γ is shown along with the TWL estimators. In Fig. 4, two window sizes of length M = 5 and M = 30 are simulated. Interestingly, while with M = 30 outperforms with M = 5, TWL with M = 5 outperforms TWL with M = 30, and achieves the overall best performance. In fact, TWL performs well even for small window sizes, M < P, when the signal of interest is sparse. Thus, TWL exhibits better tracking capability than which requires longer window size, and thus can track signals with slower variations. IV. ADAPTIVE REAL-TIME LASSO As mentioned earlier, TWL and TWL estimators are not suitable for real- implementation. In this section, online algorithms are developed and analyzed. The vector iterates developed in the next subsection provide online solvers of (6) and (3), admit a closed-form solution per iteration, and are proved convergent to x o when the unknown vector is invariant. For notational brevity, the algorithms are developed for the TWL estimator but carry over to TWL as well. A. Online coordinate descent One approach to finding the solution x TWL in (6) is to run a cyclic coordinate descent (CCD) algorithm, which in its simplest form entails cyclic iterative minimization of J TWL(x) in (7) with respect to one coordinate per iteration cycle. Let x (i ) denote the solution at and iteration i. The pth variable at the ith iteration is updated as x (i) (p) := arg min J TWL (x (i) (),...,x(i) (p ),x, x x (i ) (p + ),...,x (i ) (P)) (5) for p =,...,P. During the ith cycle each coordinate (here the pth) is optimized, while the pre-cursor coordinates (those with p < p) are kept fixed to their values at the ith cycle, and the post-cursor coordinates (those with p > p) are kept fixed to their values at the (i )st cycle. Albeit convex, the cost J TWL (x) is non-differentiable. onetheless, convergence of the CCD algorithm for Lassotype problems follows readily using the results of [25]. In addition to affording effective initialization (with the all-zero vector), another attractive feature of CCD Lasso solvers is that each coordinate-wise minimizer is available in closed form. Recent comparative studies show that CCD exhibits computational complexity similar (if not lower) than state-ofthe-art batch Lasso solvers and is numerically stable [3], [27]. The online coordinate descent (OCD) algorithm introduced next can be viewed as an adaptive counterpart of CCD Lasso, where a new datum is incorporated at each iteration; that is, the iteration index (i) in CCD is replaced in OCD by the index. The challenge arises because the cost function changes with. The crux of OCD is to update only one variable per datum in the spirit of e.g., the partial least mean-squares (PLMS) algorithm [2]. otwithstanding, PLMS is a sparsity-agnostic first-order algorithm, whereas OCD is sparsity-cognizant, it capitalizes on second-order statistics similar to, and it is also provably convergent. For notational convenience, express the index as = kp + p, where p {,...,P } corresponds to the only entry of x to be updated at, and k = P indexes the number of cycles; that is, how many s the pth coordinate is updated. Let x OCD denote the solution of the OCD algorithm at and x OCD (q) = xocd (q) for q p, which amounts to setting all but the pth coordinate at equal to those at, and selecting the pth one by minimizing (x); that is, J TWL x OCD (p) := arg min x J TWL ( x OCD (),..., x OCD (p ),x, x OCD (p + ),..., x OCD (P)). (6) In the cyclic update (6), the pre-cursor coordinates { x OCD (q), q < p} have been updated k + s, while the post-cursor entries { x OCD (q), q > p} have been updated k s. After isolating from J TWL (x) only terms which depend on the pth coordinate that is currently optimized, recursion (6) can be rewritten as (cf. (6)) [ ] x OCD (p) = arg min x 2 R (p,p)x 2 r,p x + x (7) r,p := r (p) q p R (p,q) x OCD (q). (8) Being a scalar optimization problem, it is well known that the minimization problem in (7) accepts a closed-form solution, namely [3] x OCD (p) = sgn(r,p) [ ] r,p R (p,p). (9) + Equation (9) amounts to a soft-thresholding operation that sets to zero inactive entries, thus facilitating convergence Algorithm OCD-TWL Initialize with x OCD 0 = 0, p =,..., P for k = 0,,... do for p =,..., P do S. Acquire datum y, and regressor h, = kp + p. S2. Obtain r and R as in (8). S3. Set x OCD (q) = x OCD (q) for all q p. S4. Compute r,p via (8). S5. Update x OCD (p) as in (9). end for end for

6 IEEE TRASACTIOS O SIGAL PROCESSIG (TO APPEAR) 6 to sparse iterates. The OCD-TWL scheme is tabulated as Algorithm. Convergence of OCD-TWL is established in Appendix A, and the main result can be summarized as follows. Proposition 4. For the model in () with (r), (r2), and (w) in effect, if lim = 0, it holds w.p. that lim x OCD = x o. In words, Proposition 4 asserts that the OCD-TWL estimator is strongly consistent. B. Online selective coordinate descent The OCD-TWL solver has low complexity but may exhibit slow convergence since each variable is updated every P observations. But since P P 0 due to sparsity, most of the OCD-TWL re-sets to zero inactive entries of x o. On the other hand, updating zero variables cannot be skipped a priori since new nonzero entries may arise in -varying scenarios. To address this dilemma, it is prudent to select which coordinate to update. A related selective approach has been pursued for batch Lasso in [27], and is extended here to the novel OCD solver. Let d ep J TWL( xoscd ) and d e p J TWL( xoscd ) denote the forward and backward directional derivatives w.r.t. x(p) evaluated at x OSCD, which denotes the online selective coordinate descent (OSCD) estimate at. Define also the vectors d +,d R P whose pth entries are d ep J TWL( xoscd ) and d ep J TWL( xoscd ), respectively. It is not difficult to verify that (see also [27]) d + = R x OSCD r + s + (20) d = r R x OSCD + s (2) with s +,s R P, s + (p) = if x OSCD 0 and s+ (p) = otherwise; while s (p) = if x OSCD 0, and s+ (p) = otherwise. After evaluating (20) and (2), the coordinate with the most negative directional derivative, either forward or backward, is updated. The OSCD-TWL scheme is summarized as Algorithm 2. Remark 2. A subgradient-based LMS-like is developed in [2] for sparsity-aware online estimation. However, subgradient methods are first-order algorithms that posses slow convergence. For this reason, the OCD and OSCD alternatives developed here should be preferred. C. Complexity issues Recall that the algorithm requires O(P 2 ) algebraic operations per datum. On the other hand, the OCD-TWL Algorithm 2 OSCD-TWL Initialize x OSCD 0 = 0, p =,..., P. for =, 2,... do S. Acquire datum y, and regressor h. S2. Evaluate d + and d as in (20) and (2). S3. Select p = arg min p{d + (p), d (p)} P p=. S4. Update x OSCD (q) = x OSCD (q) for all q p. S5. Compute r,p via (8). S6. Update x OSCD (p ) as in (9). end for Algorithm requires r,p, whose computational burden is O(P) given R and r. As far as OSCD is concerned, the selection step requires evaluation of d + and d whose computation entails O(PP, ) algebraic operations, where P, denotes the number of non-zero entries of x OSCD. However, the overall computational burden of the OCD-TWL algorithm is dominated by the update of R, which requires O(P 2 ) algebraic operations. In this general case, the OCD (OSCD) can be implemented cyclically to update each coordinate per datum without affecting the overall complexity in order to speed up convergence. We summarize the online cyclic coordinate descent (OCCD) TWL as in Algorithm 3. An important simplification which appears in problems such as system identification and beamforming is that regressors are sliding with ; that is h n := [h(n),h(n ),...,h(n P + )] T with h 0 = 0 P. In this case, R updates and the estimates incur complexity O(P) [9, p. 86], [28]. Likewise, OCD-TWL and OSCD-TWL in Algorithms and 2 can be also implemented with complexity O(P). Same conclusions can be drawn for online implementations of the TWL through OCD or OSCD. In a nutshell, the novel online algorithms entail complexity analogous to. V. SIMULATED TESTS The online algorithms developed in Section IV are simulated here and compared with the TWL and TWL algorithms of Section III and also with the in both -invariant as well as -varying scenarios. Gaussian observations are generated according to () with a -invariant x o, P = 30, P = 3, v n (0,σ 2 ), σ 2 = 0, h n (0 P,I P ), and windowing as in (w). The first three entries of x o are chosen equal to unity, and all other entries are set to zero. Figure 5 depicts the MSE of the OCD- TWL, OCCD-TWL, OSCD-TWL, and TWL versus. The scale is set to = 2σ 2 log P. As expected, the OCD- TWL converges to the TWL which requires the solution of a convex program per. Similar results holds for the OCCD-TWL and OSCD-TWL algorithms that also provide a means of enhancing the convergence speed. Figure 6 depicts the MSE of the OCD-TWL, OCCD- TWL, OSCD-TWL, and TWL versus. The scale is set to = 2σ log P with µ = and a = 3.7 Also in this case the online algorithms converge to their pseudo real- counterparts. ext, Gaussian observations are generated according to () with a -varying x n, and parameters P = 30, P = 3, Algorithm 3 OCCD-TWL Initialize x OCCD 0 = 0, p =,..., P for =, 2,... do S. Acquire datum y, and regressor h. S2. Obtain r and R as in (8). for p =,..., P do S3. Evaluate r,p = r (p) q p S4. Update x OCCD (p) via (9). end for end for R(p, q) xoccd (q)

7 IEEE TRASACTIOS O SIGAL PROCESSIG (TO APPEAR) 7 v n (0,σ 2 ), σ 2 = 0, and h n (0 P,I P ). A Gauss-Markov model is assumed for x n with entries generated according to x n (p) = αx n (p) + w n (p) with x 0 (p) (0,), α = 0.99, and w n (p) ( 0, α 2) for p =,2,3; (w2) is adopted with β = 0.9, and scale n= β2( n). Figure 7 shows a real- = 2σ 2 log P ization of the squared error (SE) for the OCD-TWL, OCCD- TWL, OSCD-TWL, TWL, and. The OCD-TWL exhibits performance similar to that of the. Indeed, updating one coordinate per observation in -varying settings weakens tracking capabilities [2]. However, the performance OCCD- TWL and OSCD-TWL approaches that of TWL, and both outperform the algorithm. Successively, simulated tests are performed to assess performance when the support of x n changes with. The setting in this example is identical to that of Fig. 7, except that here the support of the sparse x n also undergoes step changes. Specifically, at = 25 the third entry of x n starts decreasing, and after = 50 the same entry is set to zero. In addition, at = 25 the fourth entry becomes nonzero. Figures 8 and 9 depict, respectively, the true variations of x n (3) and x n (4) across, along with their estimates obtained using the, the OCCD-TWL, and the OCCD-TWL with n= β n µ = and a = 3.7. Observe that the developed sparsity-aware algorithms can set to zero inactive entries while estimates are not sparse and yield a nonzero value even if the true entry is zero. Moreover, after a few instants from the changing support points, the developed algorithms are able to track entries that become nonzero, and are further able to set to zero entries that disappear. Finally, the novel algorithms are tested for identifying the sparse, finite impulse response of a discrete-, linear system using input and noisy output data satisfying the input-output relationship y n = P p=0 h p x n p + v n = x T nh o + v n, n =,..., (22) where h o := [h 0,...,h P ] T collects the unknown impulse response coefficients, x n := [x n,...,x n P+ ] T denotes the given input data (the regressor vector in ()), and y n the output at n. As the system order maybe unknown, a large known upper bound P is selected. Since many entries of h o maybe zero or negligible, the impulse response is sparse. In addition, nonzero entries may exhibit slow variations, which gives rise to a -varying impulse response h n. To assess performance of the introduced algorithms, a system with P = 28 and P = 6 nonzero entries at unknown locations is simulated. The input sequence is assumed zeromean, white, Gaussian, with unit variance, and v n (0,σ 2 ) with σ 2 = 0 2., OSCD-TWL, and OSCD-TWL are tested along with the GA- for an exponential window with β = Since the regressors here are shift-invariant, all algorithms incur computational burden that scales linearly with P. The impulse response is generated according to a first-order Gauss-Markov process with α = The tuning parameters of the OSCD-TWL and OSCD-TWL have been chosen as in Fig. 8. Figure 0 depicts the MSE (averaged over 00 realizations) across. It is clear that both OSCD-TWL and OSCD-TWL outperform the. In particular, the gain of the OSCD-TWL is more than one order of magnitude. VI. COCLUDIG SUMMARY Recursive algorithms were developed in this paper for estimation of (possibly -varying) sparse signals based on observations that obey a linear regression model, and become available sequentially in. The novel TWL and TWL algorithms can be viewed as l -norm regularized versions of the. Simulations illustrated that TWL outperforms the sparsity-agnostic scheme when estimating -invariant and slowly-varying sparse signals. Moreover, the novel algorithms exhibit enhanced tracking capability with respect to especially for short observation windows. Performance analysis revealed that TWL estimates cannot simultaneously recover the signal support and maintain convergence of. This prompted the development of TWL, which for proper selection of design parameters can achieve oracle consistency properties for -invariant sparse signals. However, TWL and TWL require solving a convex problem per step, and may be less desirable for real- applications. To overcome this limitation, low-complexity sparsity-aware online schemes were also developed. The crux of these schemes is a novel optimization algorithm that implements the basic coordinate descent iteration online. Albeit simple, the resulting OCD- TWL (OCD-TWL) algorithm was proved convergent when the sparse signal is invariant. At complexity comparable to OCD-TWL (OCD-TWL) but with improved convergence speed, online selective variants choose the best coordinate to optimize and exhibit performance similar to the pseudo real TWL (TWL). APPEDIX A PROOF OF PROPOSITIO 4 Define the vector χ k := x OCD for = kp, that is, the one containing the iterates at the end of the kth cycle when each variable has been updated k s. The proof that χ k converges to x o as k will proceed in five stages. In the first one, Algorithm is put in the form of a noisy vector-matrix difference equation. The second and third stages prove that the corresponding discrete- dynamical system is exponentially stable, and that the sequence {χ k } k= is bounded. In the fourth stage, convergence to a limit point χ is proved. The proof concludes by showing that χ = x o. A. Dynamical system Let Rk denote the matrix with entries Rk (p,q) := R kp+p (p,q)/(kp +p), and r k the vector with entries r k (p) := r kp+p (p)/(kp + p). Conditions (r) and (r2) guarantee that k r w.p.. Consider the decompo- R k k R and r k sition R k = D k + L k +Ūk where D k is diagonal, and L k (Ūk) is strictly lower (upper) triangular. Observe that L k ŪT k.

8 IEEE TRASACTIOS O SIGAL PROCESSIG (TO APPEAR) 8 Dividing the cost function of the problem in (7) by kp +p B. Exponential stability yields [ ( ) ( ) RkP+p (p,p) χ k+ (p) = arg min χ 2 rkp+p,p χ χ 2 kp + p kp + p ( ) ] + χ. (23) kp + p The solution of this scalar minimization problem can be obtained in two steps. First, solve the differentiable linearquadratic part of (23) using the auxiliary vector z k+ to obtain ( ( RkP+p (p,p) rkp+p (p) z k+ (p) = arg min z q<p [ 2 kp + p R kp+p (p,q) χ k+ (q) kp + p q>p ) z 2 kp + p ) ] z R kp+p (p,q) χ k (q) kp + p (24) where r kp+p,p was expanded according to its definition in (8) and divided in two sums: the one already updated in the cycle k + (q < p), and the second one updated in the kth cycle (q > p). The second step to solve (23) is to pass z k+ (p) through the soft-threshold operator χ k+ (p) = sgn(z k+ (p)) [ z k+ (p) kp+p kp + p ]. (25) + Using the decomposition of R k, (24) can be rewritten as z k+ (p) = arg min 2 D k (p,p) z 2 ( r k (p) q<p z L k (p,q)χ k+ (q) q>p Ū k (p,q)χ k (q) whose solution is obtained (after equating the derivative to zero) as D k (p,p) z k+ (p) = r k (p) q<p L k (p,q)χ k+ (q) ) q>p Ū k (p,q)χ k (q). (26) Concatenating the latter with p =,...,P yields the matrixvector difference equation D k z k+ = r k L k χ k+ Ūkχ k. (27) The soft-thresholding operation in (25) can be accounted for by defining the error vector ǫ k := χ k+ z k+ in which case (27) can be re-written as D k (χ k+ ǫ k ) = r k L k χ k+ Ūkχ k. (28) Assuming that there exists a k such that D k + L k is invertible for each k > k, (28) can be written as χ k+ = Ḡkχ k +( D k + L k ) r k +( D k + L k ) Dk ǫ k (29) with Ḡk := ( D k + L k ) Ū k. The key point to be used subsequently is that (25) guarantees that the entries of ǫ k are bounded by a vanishing sequence. Specifically, it holds that ǫ k (p) kp+p, for p =,...,P (30) kp + p since the input-output variables of the soft-threshold operator χ = sgn(z)[ z ] + obey χ z. z Let Ḡ(l : k) := k i=l Ḡi in (29) denote the product of the transition matrices Ḡk. The goal of this stage is to prove that Ḡ(l : k) cρ k l+, with ρ <. The convergence of R k to R implies convergence of Ḡk to G := (D + L ) U, where D, L and U are the diagonal, lower triangular, and upper triangular parts of R, respectively. Since R is positive definite, the spectral radius of G is strictly less than one, i.e., (G ) < [5, p. 52]. Furthermore, for every δ > 0 there exists a c(δ) constant w.r.t. k, such that G k < c(δ)[ (G ) + δ] k [5, p. 336]. Then, by selecting δ = ( (G ))/2, and defining ρ := ( + (G ))/2, it holds that G k < cρ k, with ρ <. (3) Upon defining G k := Ḡk G, the following recursion is obtained Ḡ( : k) = ḠkḠ( : k ) = G Ḡ( : k ) + G k Ḡ( : k ) = G k + G k i G i Ḡ( : i ),Ḡ( : 0) := I. Using (3), the latter can be bounded as Ḡ( : k) cρk + c which after multiplying both sides by ρ k ρ k Ḡ( : k) c + ρ k i G i Ḡ( : i ) yields cρ G i Ḡ( : i ) ρ (i ) (32) and allows one to apply the discrete Bellman-Gronwall lemma (see e.g., [2, p. 35]). Lemma 5 (Bellman-Gronwall). If c,ξ k,h k 0 k satisfy the recursive inequality ξ k c + h i ξ i (33) then ξ k obeys the non-recursive inequality ξ k c k ( + h i ). (34) For ξ k = ρ k Ḡ( : k) and h k = cρ G k+, (32) takes the form of (33), so that (34) holds and (after multiplying both sides by ρ k ) results in Ḡ( : k) ρk c k ( ) + cρ G i = c k ( ) ρ + c G i. (35) Raising both sides of (35) to the power of /k and applying the geometric-arithmetic mean inequality, it follows that Ḡ( : k) /k c /k k ( ) ρ + c G i

9 IEEE TRASACTIOS O SIGAL PROCESSIG (TO APPEAR) 9 which is readily rewritten as Ḡ( : k) c ( ρ + c k k G i ). Since Ḡk k G and G k k 0 w.p., for every δ > 0 there exists an integer k 0 such that if k k 0, then k k G i δ w.p. Thus, if δ is selected as ( ρ )/(2c), and ρ as ( + ρ )/2, the following bound is obtained Ḡ( : k) cρk, ρ <, k k 0, w.p.. It is clear, by inspection, that the proof so far carries over even if the product of transition matrices starts at l > ; that is Ḡ(l : k) cρk l+, ρ <, k k 0, w.p.. (36) Certainly, c, ρ, and k 0 do not depend on l. However, k 0 does depend on the realization of the random sequence Ḡk, and its existence and finiteness are guaranteed w.p.. C. Boundedness Define v k := ( D k + L k ) ( r k + D k ǫ k ), and rewrite (29) as χ k+ = Ḡkχ k + v k. Using back substitution, χ k+ can then be expressed as k k χ k+ = Ḡ(k 0 0 : k)χ k0 + Ḡ(k i + : k)v k i + v k. Since D k, L k, r k, and ǫ k converge, the random sequence v k converges too w.p. ; hence, it can be stochastically bounded by a random variable v; that is, v k < v, k, w.p.. This, combined with the exponential stability ensured by (36), guarantees that the realizations of the random sequence χ k are (stochastically) bounded; thus D. Convergence k k 0 χ k+ cρ k+k0 χ k0 + cρ i v + ( ) v c χ k0 + cv ρ +, w.p.. (37) Define the error D k := D k D, and similarly L k := L k L, Ũ k := Ūk U, and r k := r k r. Using these new variables, (28) can be rewritten in error form as (D + D k )(χ k+ ǫ k ) = (r + r k ) (L + L k )χ k+ (U + Ũk)χ k (38) and, after regrouping terms, as χ k+ = (D + L ) U χ k + (D + L ) (r + r k ( D k + L k )χ k+ + (D + D k )ǫ k Ũkχ k ). (39) Equation (39) describes an exponentially stable linear invariant system with transition matrix (D + L ) U, and input u k := (D +L ) [r + r k ( D k + L k )χ k+ + (D + D k )ǫ k Ũkχ k ]. The input can be divided into its limiting point u := (D +L ) r, and the error ũ k := (D +L ) [ r k ( D k + L k )χ k+ +(D + D k )ǫ k Ũkχ k ]. As k, the vector ũ k goes to zero almost surely because the sequence χ k is bounded, and the error r k, Dk, Lk and Ũ k as well as ǫ k, all go to zero w.p.. With this notation and recalling the definition G := (D + L ) U, (39) is rewritten as χ k+ = G χ k + u + ũ k and back-substituting again the new expression for χ k+ yields χ k+ = G k χ + G k i u + G k i ũ i. (40) Convergence of this recursion will be established by showing that the first and third terms in the right-hand side vanish as k, while the surviving one corresponds to a stable geometric series. Given that c > 0,ρ < such that n G n cρ n, convergence of the first term to zero follows readily from (3). The third term represents the limiting output value of a multiple input-multiple output stable linear -invariant system with vanishing input; that is lim i ũ i = 0. As k G k cρ k, (3) implies that it is possible to bound the sum under consideration as G k i ũ i c ρ k i ũ i. (4) Since lim i ũ i = 0, it holds by the definition of the limit that for any ǫ > 0 so that ũ i ǫ, i. Using the latter along with (4), it follows that for k G k i ũ i c = cρ k ρ k i ũ i + cǫ ρ i i= ũ i + cǫ ρ k i i= ρ k i (42). Because ρ i ũ i does not depend on k, the limit of the first summand in (42) goes to zero; hence, lim G k i ũ i cǫ/( ρ ). k The last inequality holds ǫ > 0; thus, lim G k i ũ i = 0 k which establishes convergence to zero of the third sum in the right-hand side of (40). E. Limit point Once convergence is established, it is possible to take the limit as k in (38) to obtain D (χ ǫ ) = r L χ U χ, w.p.. (43)

10 IEEE TRASACTIOS O SIGAL PROCESSIG (TO APPEAR) 0 Recalling that ǫ lim = 0, (43) reduces to (D + L + U )χ = r, w.p. (44) and since D + L + U = R, it holds that χ = (R ) r = x o, w.p. (45) which concludes the proof. ACKOWLEDGMETS The work in this paper was supported by the SF grants CCF and CO , and also through collaborative participation in the Communications and etworks Consortium sponsored by the U. S. Army Research Laboratory under the Collaborative Technology Alliance Program, Cooperative Agreement DAAD The U. S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation thereon. REFERECES [] H. Akaike, A new look at the statistical model identification, IEEE Trans. Automatic Control, pp , 974. [2] D. Angelosante and G. B. Giannakis, -weighted Lasso for adaptive estimation of sparse signals, in Proc. IEEE Intl. Conf. Acoust. Sp. Sig. Proc., Taipei, Taiwan, Apr [3] J.-A. Bazerque and G. B. Giannakis, Distributed spectrum sensing for cognitive radios by exploiting sparsity, Proc. of 42nd Asilomar Conf., Pacific Grove, CA, Oct , [4] E. Candès, J. Romberg, and T. Tao, Stable signal recovery from incomplete and inaccurate measurements, Communications on Pure and Applied Mathematics, vol. 59, no. 8, pp , [5] E. Candès and T. Tao, ear optimal signal recovery from random projections: Universal encoding strategies?, IEEE Trans. on Information Theory, vol. 52, no. 2, pp , [6] E. Candès and T. Tao, The Dantzig selector: Statistical estimation when p is much larger than n, Annals of Statistics, vol. 35, no. 6, pp , [7] E. Candès, M. Wakin and S. Boyd, Enhancing sparsity by reweighted l minimization, J. Fourier Anal. Appl., vol. 4, pp , [8] S. S. Chen, D. L. Donoho and M. A. Saunders, Atomic decomposition by basis pursuit, SIAM J. on Scientific Computing, vol. 20, no., pp. 33 6, 998. [9] S. F. Cotter and B. D. Rao, Sparse channel estimation via matching pursuit with application to equalization, IEEE Trans. on Communications, vol. 50, no. 3, pp , March [0] J. Fan and R. Li, Variable selection via nonconcave penalized likelihood and its oracle properties, J. Amer. Statist. Assoc., vol. 96, pp , 200. [] D. P. Foster and E. I. George, The risk inflation criterion for multiple regression, Annals of Statistics, vol. 22, no. 4, pp , 994. [2] M. Godavarti, A. O. Hero III, Partial update LMS algorithms, IEEE Trans. on Signal Processing, Vol. 53, o. 7, pp , July [3] J. Friedman, T. Hastie, H. Höfling and R. Tibshirani, Pathwise coordinate optimization, Annals of Applied Statistics, vol., pp , Dec [4] Y. Gu Y. Chen and A. O. Hero, Sparse LMS for system identification, in Proc. IEEE Intl. Conf. Acoust. Sp. Sig. Proc., Taipei, Taiwan, Apr [5] G. Golub and C Loan, Matrix Computations, Johns Hopkins University Press, Baltimore, 996. [6] K. Knight and W. J. Fu, Asymptotics for Lasso-type estimators, Annals of Statististcs, vol. 28, pp , [7] J. Löfberg, Yalmip: Software for solving convex (and nonconvex) optimization problems, in Proc. Amer. Contr. Conf., Minneapolis, M, June [8] D. M. Malioutov, S. Sanghavi and A. S. Willsky, Compressed sensing with sequential observations, Proc. of Int. Conf. Acoustics, Speech, Signal Processing (ICASSP), Las Vegas, V, April [9] A. H. Sayed, Fundamentals of Adaptive Filtering, John Wiley & Sons, [20] G. Schwarz, Estimating the dimension of a model, Annals of Statistics, vol. 6, no. 2, pp , 978. [2] V. Solo and X. Kong, Adaptive signal processing algorithms: stability and performance, Prentice-Hall, Upper Saddle River, J, 995. [22] J. Sturm, Using sedumi.02, a matlab toolbox for optimization over symmetric cones, Optim. Meth. Softw., vol., no. 2, pp , 999. [23] G. Taubock and F. Hlawatsch, A compressed sensing technique for OFDM channel estimation in mobile environments: Exploiting channel sparsity for reducing pilots, Proc. of Int. Conf. Acoust. Sp. Sig. Proc., Las Vegas, V., Apr [24] R. Tibshirani, Regression shrinkage and selection via the Lasso, J. Royal. Statist. Soc., vol. 58, no., pp , 996. [25] P. Tseng, Convergence of block coordinate descent method for nondifferentiable minimization, J. Optim. Theory Appl., vol. 09, pp , June 200. [26]. Vaswani, Kalman filtered compressed sensing, Proc. of Intl. Conf. Image Proc, San Diego, CA, Oct [27] T. T. Wu and K. Lange, Coordinate descent algorithms for Lasso penalized regression, Annals of Applied Statistics, vol. 2, pp , March [28] G. P. White, Y. V. Zakharov and J. Liu, Low-complexity algorithms using dichotomous coordinate descent iterations, IEEE Trans. Sign. Proc., vol. 56, pp , July [29] H. Zou, The adaptive Lasso and its oracle properties, J. of the American Statistical Association, vol. 0, no. 476, pp , [30] H. Zou and R. Li, One-step sparse estimates in nonconcave penalized likelihood models, Annals of Statistics, vol. 36, no. 4, pp , 2008.

11 IEEE TRASACTIOS O SIGAL PROCESSIG (TO APPEAR).5 aµ t µ t =0.2, a=3.7 TWL TWL 8 7 6, M=5 TWL, M=5, M=30 TWL, M=30 5 w µt (x) SE µ t x Fig.. Weight functions for TWL and TWL estimators. Fig. 4. Squared error comparisons of pseudo real- estimators (varying x n with finite window) GA TWL TWL TWL OCD TWL OCCD TWL OSCD TWL MSE 0 MSE Fig. 2. MSE comparisons of pseudo real- estimators (-invariant x o). Fig. 5. MSE comparisons of online estimators (-invariant x o) GA TWL TWL TWL OCD TWL OCCD TWL OSCD TWL 3 SE MSE Fig. 3. Squared error comparisons of pseudo real- estimators (varying x n with exponentially decreasing window). Fig. 6. x o). MSE comparisons of online weighted-norm estimators (-invariant

An On-line Method for Estimation of Piecewise Constant Parameters in Linear Regression Models

Preprints of the 8th IFAC World Congress Milano (Italy) August 8 - September, An On-line Method for Estimation of Piecewise Constant Parameters in Linear Regression Models Soheil Salehpour, Thomas Gustafsson,