SYLVAIN SARDY. Department of Mathematics, University of Geneva. added flexibility thanks to more than one regularization parameters

Size: px

Start display at page:

Download "SYLVAIN SARDY. Department of Mathematics, University of Geneva. added flexibility thanks to more than one regularization parameters"

Elisabeth Adams
5 years ago
Views:

1 Smooth James-Stein model selection against erratic Stein unbiased risk estimate to select several regularization parameters SYLVAIN SARDY Department of Mathematics, University of Geneva Smooth James-Stein thresholding-based estimators enjoy smoothness like ridge regression and perform variable selection like lasso. They have added flexibility thanks to more than one regularization parameters (like adaptive lasso), and the ability to select these parameters well thanks to a unbiased and smooth estimation of the risk. The motivation is a gravitational wave burst detection problem from several concomitant high-frequency time series. A wavelet-based estimator is developed to combine information from all captors by blockthresholding multiresolution coefficients. We derive a universal threshold, an information criterion and an oracle inequality for this estimator. Smooth James-Stein thresholding is then employed in parametric linear regression for which we derive a formula for unbiased risk estimation. 2-4 rue du Lièvre, CP 64, 1211 Genève 4, Switzerland; sylvain.sardy@unige.ch 1

2 Smooth James Stein 2 We perform a Monte-Carlo simulation to quantify the improvement and investigate the oracle property of smooth adaptive lasso empirically. Introducing the smoothness parameter allows smooth versions of existing estimators, for which we derive their equivalent degrees of freedom. Conversely, letting the smoothness parameter tend to one, we derive the equivalent degrees of freedom of lasso, adaptive lasso and group lasso. So far, only that of lasso is known. Keywords: adaptive lasso, information criterion, James-Stein estimator, multivariate time series, sparse model selection, universal threshold, wavelet smoothing

3 Smooth James Stein 3 1 Introduction The James-Stein estimator (James and Stein, 1961) lead to the idea that the MLE could be regularized to achieve better risk. Many methods of regularization (e.g., ridge regression (Hoerl and Kennard, 1970), Waveshrink (Donoho and Johnstone, 1994), nonnegative garrote (Breiman, 1995), lasso (Tibshirani, 1996)) are driven by this idea and control the bias-variance trade-off with a single regularization parameter chosen to minimize an estimate of the risk. For instance cross validation (Stone, 1974) is easy to implement to estimate the risk for any chosen loss function, but is known for its instability and its high computational cost. Generalized cross validation (Golub et al., 1979), information criteria (Akaike, 1973; Schwarz, 1978) or Stein unbiased risk estimate (Stein, 1981) can alleviate those drawbacks. More recently, estimators with several regularization parameters (e.g., bridge (Fu, 1998), SCAD (Fan and Li, 2001a), EBayesTresh (Johnstone and Silverman, 2004, 2005), fused- (Tibshirani et al., 2005), elastic net (Zou and Hastie, 2005), adaptive-lasso (Zou, 2006) and l ν -regularization (Sardy, 2009)) have been developed. It is often believed that, although these estimators are more flexible, their risk may be worse because selection of more than one regularization parameter is unstable. One goal of this paper is to prove the contrary thanks to smooth James-Stein thresholding. We consider two situations where we employ smooth James-Stein thresholding. Section 2 presents the original motivation for our work, gravitational wave burst detection, in a nonparametric setting that we analyze in Section 4 after reviewing important results in modern model selection in Section 3. Section 5 considers the parametric regression setting with measured covariates. These two settings show that smooth James-Stein thresholding helps select the hyperparam-

4 Smooth James Stein 4 eters better than with standard (non-smooth) thresholding. Moreover it offers a continuum between estimators by linking and extending the James-Stein estimator, soft-, hard- and block-thresholding (Donoho and Johnstone, 1994; Cai, 1999) in wavelet smoothing, and group lasso (Yuan and Lin, 2006) and adaptive lasso in variable selection. 2 Motivation Gravitational wave bursts are expected to be produced by energetic cosmic phenomena such as the collapse of a supernova (Klimenko and Mitselmakher, 2004). They are believed to be rare, highly oscillating and of small intensity. Because the signal-to-noise ratio is believed to be low, only the concomitant recording by Q captors (typically Q = 3) of the cosmic phenomena may help prove the existence of such wave bursts. Note that the captors are located far apart from each other to avoid recording local earth phenomena (such as an earthquake) on all Q captors. Another difficulty is the non-white nature of the noise, possibly non-gaussian. Finally the measurements are recorded at a high frequency of 5MHz; one minute of recording has 3Q 10 5 noisy measurements. A good model for these data is S (q) t = µ (q) (t) + ɛ (q) t, t = 1,..., T, q = 1..., Q (1) where the noises ɛ (q) and ɛ (q ) are independent between captors q q, but where the noise ɛ (q) t is temporally correlated for a given captor q. Importantly, most of the time we have on the one hand that the underlying signal µ (q) (t) = 0 for all q, but on the other hand that if µ (q) (t) 0 for a given time t and captor q, then µ (q ) (t) 0 for all other captors q at the same time t. Moreover when a wave burst occurs, we

5 Smooth James Stein 5 may not have µ (q) (t) = µ (q ) (t), but only a proportionality constant relate them, because the incoming wave burst may not hit the captor with the same angle. We assume each underlying signal µ (q) expands on T = N orthonormal approximation φ and fine scale ψ wavelets: µ (q) (t) = 2 j 0 1 k=0 β 0 (q) k φ j 0,k(t) + J j=j 0 N j 1 n=0 β (q) j,nψ j,n (t), (2) where the wavelets are obtained by dilation j and translation n, J = log 2 (N) and N j = 2 j ; see Donoho and Johnstone (1994). One can extract an orthonormal regression matrix W = [Φ 0 Ψ j0... Ψ J ] from this representation such that (1) becomes S (q) = W β (q) + ε (q). Applying the orthonormal wavelet decomposition W T, the model can also be written as S (q) = β (q) + ε (q), (3) where S (q) = W T S(q) and ε (q) = W T ε (q). This latter model is interesting in three aspects. First Johnstone and Silverman (1997) show that stationary correlated noise ε q is well decorrelated by a wavelet transform within and between levels. For a given level j, the process has its own wavelet variance σ 2,(q) j (Percival, 1995; Serroukh et al., 2000) that depends on captor q and that can be estimated from the data (Donoho and Johnstone, 1995). Figure 2 shows (left) the estimated spectrum from 26 seconds of recording which shows a strong colored noise process, and (right) boxplots of 31 estimated wavelet variances on 8 levels from disjoint segments of signals of length N = 4096 from one captor. Second, the data are Gaussianized by the linear transformation W T. The last interesting property is that, owing to the sparse wavelet representation of many signals, the vectors β (q) are sparse. Moreover, letting β (q) = (β (q) j 0,..., β (q) ), the sparsity (percentage of null entries) J

6 Smooth James Stein 6 varies from level j to level j, so that the estimation should adapts to sparsity levelwise by selecting hyperparameters at each level. Consequently model (3) can be written at a given level j {j 0,..., J} as Y (q) j = α (q) j + ε (q) j with α (q) j Y (q) j = β (q) j /σ (q) j = S (q) j /σ (q) j, (4) where α (q) j = (α j,1, (q)..., α (q) j,n j ) is a sparse vector and ε (q) j i.i.d. N(0, 1). The important fact to recall is that, given wavelet level j {j 0,..., J} and translation Estimated spectrum Boxplot of estimated wavelet variance spectrum 1e 11 1e 08 1e 05 1e 02 estimated log(sigma_j) frequency bandwidth = wavelet level j Figure 1: EDA from 26 seconds recording: (left) estimated spectrum and (right) boxplots of estimated wavelet variances on a log-scale from segments of one second for level 4 to 11.

7 Smooth James Stein 7 n {1,..., N j }, then α j,n = (α (1) j,n,..., α (Q) j,n ) has either all entries null α j,n = 0 (due to no wave burst, or to no component of the wave burst on wavelet n at level j), or, when α j,n = 0, then its entries are typically all different (due to captors facing the arriving wave burst at different angles, or to different wavelet variances σ 2,(q) j and σ 2,(q ) j for captors q q ). Our goal is to derive an estimator that reflects this block sparsity and heterogeneity level-by-level. 3 Threshold, penalty and model selection James and Stein (1961) showed the surprising result that for estimating a vector α IR P with P > 2 from a Gaussian measurement Y N(α, I), the maximum likelihood estimate (MLE) is not admissible with respect to the l 2 -risk, R( ˆα, α) = E ˆα α 2 2. Indeed they showed that ˆα JS (Y) = (1 P 2 Y 2 2 )Y verifies R( ˆα JS, α) < R( ˆα MLE, α) α IR P, P > 2. The James-Stein estimator above does not shrink (because the factor 1 P 2 Y 2 2 be less than 1) and does not threshold (i.e., P( ˆα JS (Y) = 0) = 0). The drawback that the factor can be negative has been alleviated by taking its positive part can ˆα JS+ (Y) = (1 P 2 ) Y 2 + Y, (5) 2 which makes it not only a shrinkage estimator but also a thresholding estimator. Note that this latter estimator, although better than its previous version in terms

8 Smooth James Stein 8 of l 2 -risk, remains inadmissible. Interestingly the James-Stein estimator has the following oracle inequality R( ˆα JS+, α) 2 + inf c IR R( ˆαc, α), (6) with respect to the linear estimator ˆα c = c ˆα MLE, for which the oracle factor is c = α 2 2/(P + α 2 2) and the optimal l 2 -risk is R (c, α) = inf c R( ˆα c, α) = P α 2 2/(P + α 2 2). Candès (2005) provides an interesting review on oracle inequalities. This result lead to the theme of regularization in the form of shrinkage, penalized likelihood, thresholding, or posterior distribution. Hoerl and Kennard (1970) with ridge regression, Good and Gaskins (1971) with penalties, Wahba s smoothing splines (see Wahba (1990) for a recent account) are examples of early key contributions. More recent contributions are Donoho and Johnstone (1994) with Waveshrink and Tibshirani (1996) with lasso. Both estimators assume the model Y N(Xα, I), where X is an N P matrix of wavelets and of covariates, respectively. They aim at estimating α while enforcing sparsity: some entries of the estimated coefficients ˆα are set to zero for sparse wavelet representation and model selection purposes, respectively. On the one hand Waveshrink (which uses an orthonormal matrix X) regularizes the wavelet coefficients ˆα MLE = X T Y by thresholding them componentwise towards zero with the nonlinear hard or soft thresholding functions η hard ϕ (x) = x 1 { x ϕ}, (7) η soft ϕ (x) = sign(x)( x ϕ) +, (8) where ϕ is the threshold: the estimate ˆα n,ϕ = η ϕ (ˆα MLE n ) is zero if ˆα MLE ϕ n

9 Smooth James Stein 9 for n = 1,..., N. Interestingly, when choosing the universal threshold ϕ N = 2 log N, the estimators have the following oracle inequality R( ˆα hard/soft ϕ N, α) (2 log N + 1)(1 + inf δ {0,1} N R( ˆαδ, α)), (9) with respect to the diagonal projector ˆα δ = diag(δ) ˆα MLE, for which the oracle binary factors are δ n = 1 { ˆα MLE n 1} and the optimal l 2-risk is R (δ, α) = inf δ {0,1} N R( ˆα δ, α) = N n=1 min( ˆα n MLE 2, 1). Comparing (9) with (6), the diagonal projector has a bound log N increasing with N as opposed to the constant bound 2 of the linear estimator, but the oracle risk of (9) uses N oracle parameters as opposed to one for (6) and has therefore more degrees of freedom. The universal threshold has asymptotic minimax properties (Donoho et al., 1995). On the other hand lasso regularizes the MLE by adding an l 1 penalty to the likelihood part and by estimating α solving 1 min α 2 Y Xα λ α 1, (10) for a given penalty parameter λ 0. Interestingly, when X = I and λ = ϕ then the soft shrinkage function applied componentwise to Y is the closed form solution to (10) (Donoho et al., 1992); for other thresholding penalties see Antoniadis and Fan (2001). Lasso is not an oracle procedure (Fan and Li, 2001b), so, for a given ν > 1, Zou (2006) proposed the adaptive lasso by solving 1 min α 2 Y Xα λ P p=1 1 ˆα p ν 1 α p, (11) which is oracle provided λ is properly chosen and ˆα p are root-n-consistent estimates for p = 1,..., P, e.g., the least squares estimate.

10 Smooth James Stein 10 Waveshrink and lasso have been extended to achieve thresholding blockwise, as opposed to componentwise. For instance Cai (1999) proposed the James-Stein-like estimator ˆα Block jb = (1 λl ˆα MLE ) jb 2 + ˆα MLE jb (12) 2 to threshold block b made of L = log N raw wavelet coefficients ˆα MLE jb at each resolution level j. He showed excellent asymptotic properties and an oracle inequality similar to (6) but with a constant factor of (instead of one) in front of the oracle risk. Likewise Bakin (1999) and Yuan and Lin (2006) proposed group lasso to threshold blocks of coefficients by arbitrarily splitting the vector α into J blocks (α 1,..., α J ) and solving 1 J min α 2 Y Xα λ α j 2, (13) j=1 where α j 2 = pj i=1 α 2 ji and p j is the length of the vector α j, j = 1,..., J. Hence when J = P, then all blocks have size one and the estimator becomes the standard lasso (10). The paper proposes to unify and extend these thresholding estimators with two goals in mind: offer a continuum of estimators for more flexibility, and select the indexes of the continuum in an efficient way for better estimation, first in the canonical regression setting, then in standard parametric regression. The paper is organized as follows. In Section 4, for the canonical regression problem with concomitant measurements, we define the generalized and smooth James-Stein estimators that operate blockwise and select two hyperparameters. Section 4.2 is concerned with the selection of the two hyperparameters for which we propose two rules: (1) the Stein unbiased risk estimate rule of Section shows that the more smooth the estimator with respect to the data, the less erratic, yet still unbiased,

11 Smooth James Stein 11 the estimation of its risk; (2) the universal threshold and an information criterion rule of Section are based on the asymptotic property that a zero sequence should be correctly estimated with a probability tending to one for a threshold increasing at a slow rate. Section 4.3 comes back to our gravitational wave burst detection problem to illustrate the methodology. For blocks of a constant size, Section 4.4 derives an oracle inequality for block thresholding with the universal threshold. Section 4.5 investigates the finite sample performance of the proposed estimator on a Monte-Carlo experiment. Finally, motivated by the excellent performance of smooth James-Stein thresholding, Section 5 considers the classical linear regression setting for the selection of covariates. We implicitly define the smooth James-Stein estimator as a fixed point and derive its equivalent degrees of freedom as a function of the two hyperparameters. This gives the equivalent degrees of freedom of the adaptive and group lasso as a particular case. We illustrate the methodology with a Monte Carlo simulation and the prostate cancer data. 4 Generalized and smooth James-Stein thresholding 4.1 Definition We consider model (4) that we rewrite componentwise and for simplicity without wavelet level j as Y (q) n = α (q) n + ɛ (q) n, n = 1,..., N, q = 1..., Q (14)

12 Smooth James Stein 12 where the Gaussian noise is i.i.d. N(0, 1) (independent between q q and within sequences). To fix notation the vectors Y n = (Y n (1),..., Y n (Q) ) measure α n = (α n (1),..., α n (Q) ) which are believed to be the zero vector most of the time. To threshold vectors to zero, soft-waveshrink can be generalized in the spirit of group lasso to the following block-soft-waveshrink estimator that solves 1 min α n IR Q,n=1,...,N 2 N N Y n α n λ α n 2. (15) n=1 n=1 This penalized least squares optimization can be separated in N subproblems of dimension Q. In turn one can show that the solution to each Q-variate subproblem is a vector collinear with Y n, but thresholded towards zero. Indeed the solution to (15) is ˆα n,λ = (1 λ Y n 2 ) + Y n, n = 1,..., N, (16) which is reminiscent of the James-Stein estimator (5) and different from Cai s estimator (12) in that Q is fixed and does not increase with N. Interestingly the 2-norm in (16) is not squared as in (5). To bridge between the two estimators and extend them, we propose first the generalized James-Stein (GJS) thresholding estimator ˆα n,λ,ν = η GJS λ,ν (Y n ) := (1 λν ) Y n ν + Y n λ 0, ν > 0, Y n IR Q (17) 2 that encompasses (16) for ν = 1 and (5) for ν = 2. Expressed as penalized least squares, the GJS estimator is solution to 1 min α n 2 Y n α n λ ν 1 Y n ν 1 α n 2, (18) 2

13 Smooth James Stein 13 which can also be seen as a group version of adaptive lasso. With this parametrization, λ is always the threshold value regardless of ν. Note that ν = 1 corresponds to soft-thresholding (8), ν corresponds to hard (7), and 0 < ν < 1 is a thresholding function that biases the estimate even more than soft thresholding. Since GJS thresholding is almost differentiable, we consider the Stein unbiased risk estimate (Stein, 1981) to select both hyperparameters in the next section. The univariate thresholding function (Q = 1) of the generalized James-Stein estimator is plotted on the top left corner of Figure for ν = 10 and λ = 3. The choice ν = 10 makes the thresholding function close to hard thresholding. A property that reveals important in Section is that the thresholding function, although almost differentiable, is forced to have a large right derivative η λ,ν(λ + ) = ν at the threshold λ for large ν since the thresholding function converges to hard when ν. This motivates the smooth James-Stein estimator ˆα n,λ,ν;γ = η SJS λ,ν;γ(y n ) := (1 λν ) γ Y n +Y ν n λ 0, ν > 0, γ 1, Y n IR Q (19) 2 that uses a thresholding function with smooth higher derivatives for reasons we now explain. 4.2 Hyperparameters selection The added flexibility translates into better estimation of α provided both λ and ν can be well selected from the data. To that aim we propose two selection rules. The first one minimizes an unbiased estimate of the risk and the second one minimizes an information criterion. The second one is preferable in a situation of strong sparsity and the first one when the underlying vector is more dense.

14 Smooth James Stein Stein unbiased risk estimate For the multivariate canonical model (14), we first consider the generalized James- Stein estimator (17). Its corresponding thresholding function is almost differentiable in Stein sense (Stein, 1981), and one finds that the Stein unbiased risk estimate is SURE(λ, ν) = N N Y n 2 λ 2ν Y n ( Y n 2 λ) + 1( Y n=1 n=1 Y n 2ν n 2 > λ) 2 N +2 (Q λν (Q ν) )1( Y n=1 Y n ν n 2 > λ) NQ. (20) 2 This bivariate function can be minimized over λ and ν to select both hyperparameters. Section 4.5 investigates the finite sample performance of this selection with a Monte-Carlo simulation. It reveals not as good as the extension of SURE for the non-almost differentiable thresholding obtained by l ν penalized likelihood (Sardy, 2009). This drawback of SURE applied to the generalized James-Stein estimator can be alleviated in the following way. The SURE function (20) is discontinuous in λ at all Y n 2, n = 1,..., N. Consequently the unbiased risk estimate function is erratic in λ (i.e., high variance) and hard to minimize. This is particularly true when ν is large compared to Q because the jumps at the discontinuity points are large when Q ν is large, as we can see on the second line of (20). This is due to the fact that generalized James-Stein thresholding is discontinuous in its first derivative and that the larger ν the steeper the slope at the threshold value to approximate the hard thresholding function. To overcome this drawback, we propose the smooth James-Stein estimator (19), which not only thresholds but also has smoother higher derivatives for γ > 1, and

15 Smooth James Stein 15 corresponds to (17) for γ = 1. The corresponding Stein unbiased risk estimate SURE(λ, ν; γ) = N N Y n 2 λ 21( Y n 2 λ) + 2Q {1 [ ] ν } γ + n=1 n=1 Y n 2 N λ + ({1 [ ] ν } γ 1) 2 Y n 2 n=1 Y n 21( Y n 2 > λ) 2 N λ +2γν {1 [ ] ν } γ 1 λ + [ ] ν NQ (21) n=1 Y n 2 Y n 2 is consequently less erratic. Figure illustrates the advantage of using a large γ when estimating the risk with SURE for a large ν. In practice, γ could be estimated from the data by minimizing a three-dimensional SURE function. We adopt a more pragmatic approach: we set γ = 2 log ν + 1 so that the thresholding function departs from the hard thresholding function and increases its smoothness only slightly as ν gets larger (thanks to the slowly increasing logarithmic function on ν), and so that for ν = 1 we fall back to the soft thresholding function (Donoho and Johnstone, 1995) since γ = 1 in that case. Section 4.5 shows that this choice leads to better risk estimation with (21) for free γ than with (20) for γ = 1. This property implies better selection of the two hyperparameters, and hence provides the smooth James-Stein estimator with better l 2 risk, as the Monte-Carlo simulation shows in Section 4.5. Model (14) and the SURE formulas (20) and (21) assume unit variance for the noise. In practice, one can estimate the variance and rescale the responses to have approximate unit variance. To avoid estimating the variance, one can observe that the Stein formula (20) and (21) can be written in the form RSS+2σ 2 d.f.+c, where d.f. stands for degrees of freedom. Like generalized cross validation (Golub et al.,

16 Smooth James Stein 16 8 Subbotin l ν penalty: ν=.4 Thresholding function 8 Generalized James Stein: ν=10, γ=1 Thresholding function for λ=3 8 Smooth James Stein: ν=10, γ=2log ν+1 Thresholding function for λ= η λ,ν (x) 0 η λ,ν (x) 0 η λ,ν;γ (x) x x x 1400 True loss and SURE 1400 True loss and SURE 1400 True loss and SURE SURE(λ, ν=.4) SURE(λ,ν=10) SURE(λ,ν=10;γ=2log ν +1) log λ log λ log λ Figure 2: Thresholding functions and corresponding Stein unbiased risk estimation for Q = 1 and ν = 10 on simulated data of length N = Left: Subbotin l ν penalized least squares; Middle: generalized James-Stein (γ = 1); Right: smooth James-Stein for a smoothness parameter γ = 2 log ν + 1. Top: thresholding functions with parameters chosen to approximate hard thresholding. The left one is discontinuous at the threshold, but with a small slope at the threshold, the middle one has a sharp change of the left and right derivatives at the threshold, and the right one has a smooth change of derivative at the threshold. Bottom: corresponding Stein unbiased risk estimate (blue curve) and true loss (red curve). We observe on the same data that the right estimation is less erratic than the middle one. 1979), one can define a generalized SURE as GSURE(λ, ν; γ) = RSS(λ, ν; γ)/n (1 d.f.(λ, ν; γ)/n) 2, which is a first order approximation to the exact SURE formula and which estimates the variance by ˆσ 2 = RSS(λ, ν; γ)/n. In practice SURE can be minimized on a fine grid thanks to rapid calculation of the estimate (19) for any give pair (λ, ν).

17 Smooth James Stein Universal threshold and information criterion We derive a universal threshold and an information criterion for the generalized James-Stein estimator. The idea is to approximate the distribution of the smallest threshold λ Y that, for a sample Y = (Y 1,..., Y N ) of size N, sets to zero all N blocks of length Q when the true underlying model is made of zero vectors. Controlling the maximum of λ Y then leads to a finite sample λ N,Q and asymptotic λ N,Q universal threshold, and a prior π λ, which are used to define the information criterion as for the sparsity l ν information criterion SL ν IC (Sardy, 2009), as we now show. With the assumption that Y n N Q (0, I Q ) for n = 1,..., N, we seek the smallest threshold λ N,Q such that the generalized James-Stein estimator (17) estimates the right model with a probability tending to one: P( ˆα 1,λ,ν = 0,..., ˆα N,λ,ν = 0) = P( Y λ 2 N,Q,..., Y N 2 2 λ 2 N,Q) N 1. (22) The distribution of M N = max N n=1 Y n 2 2, where Y n 2 2 i.i.d. χ 2 Q = Γ(Q/2, 1/2), is degenerate. Extreme value theory provides proper rescaling of M N for c 1 N (M N d N (Q)) d G 0 (x), where G 0 (x) = exp( exp( x)) is the Gumbel distribution, c N = 2 and d N (Q) is the root in ξ to log N log Γ(Q/2) = (1 Q/2) log(ξ/2) + ξ/2. (23) The normalizing constant d N (Q) = 2(log N+(Q/2 1) log log N log Γ(Q/2)) given by Embrechts et al. (1997, p.156) for the Gamma distribution is an asymptotic approximation to the root of (23), and leads to a good Gumbel approximation

18 Smooth James Stein 18 when N is large compared to Γ(Q/2). In that case we define the asymptotic universal threshold λ N,Q = 2(log N + (Q/2) log log N log Γ(Q/2)), (24) for which the asymptotic property (22) is satisfied since P( Y λ 2 N,Q,..., Y N 2 2 λ 2 N,Q) = G 0 (log log N) 1 1/ log N N 1. Note that for Q = 1 we get back the standard universal threshold 2 log N up to a small term, and for Q = 2 we get back the universal threshold of Sardy (2000) for denoising complex-values wavelet coefficients. When Q gets large however, the proposed normalizing constant d N (Q) is too far from the exact root to provide a useful approximation, so we find the root d N (Q) of (23) numerically, which exists provided N is large enough (e.g., N > log(γ(q/2)) + (Q/2 1)(1 log(q/2 1))). The finite sample universal threshold is then defined as λ N,Q = (25) d N (Q) + c N log log N with d N (Q) root of (23) (26) to have the same rate of convergence for all Q with λ N,Q in place λ N,Q in (25). Property 1. The inequality λ 2 N,Q λ 2 N,Q holds for all Q 2 and all N N 0 (Q) = exp(γ(q/2) 1/(Q/2 1) ), and λ 2 N,Q λ 2 N,Q as N. Proof: Studying (23) as a function of ξ, one sees that the root for a finite sample is larger than the asymptotic root given by Embrechts et al. (1997, p.156) provided (Q/2 1) log log N log Γ(Q/2) 0. More than a bound the asymptotic Gumbel pivot for M N leads to a prior

19 Smooth James Stein 19 distribution F λ (λ) = G 0 ((λ 2 d N (Q))/2) of the threshold λ to reconstruct true zero vectors from noisy measurements. Bayes theorem provides the joint posterior distribution of the coefficients and the hyperparameters. Taking its negative logarithm leads to the following information criterion in the spirit of the sparsity l ν information criterion (Sardy, 2009). Definition. Suppose model (14) or model (3.1) of Cai (1999) for block thresholding in wavelet smoothing holds. The sparsity weighted l 2 information criterion for estimation of (α 1,..., α N ) and selection of (λ, ν) of the generalized James-Stein estimator (18) is SL w 2 IC(α 1,..., α N, λ, ν) = 1 N Y n α n λ ν N 2 n=1 n=1 Γ(Q/2) N log( 2π Q/2 Γ(Q) ) 1 Y n ν 1 2 α n 2 N QNν log λ + Q(ν 1) log Y n 2 n=1 log π λ (λ; τ N,Q ) log π ν (ν), (27) where π ν is a prior for ν (that we choose Uniform on Ω ν (0, ) (possibly improper) where thresholding occurs), π(λ; τ) = F (λ; τ) with F λ (λ; τ) = G 0 ((λ 2 /τ 2 d N (Q))/2) and τ is calibrated to τn,q 2 = λ 2 N,Q/(QNν + 1) to match the asymptotic model consistency when the true sequences are null, i.e., when α n = 0, n = 1,..., N. In practice, one minimizes SL w 2 IC like AIC or BIC to select the hyperparameters (λ, ν) and estimate the sequences α n, n = 1,..., N. Deriving an information criterion for the smooth James-Stein estimator (i.e., γ > 1) requires knowing the equivalent optimization problem it solves, which is an open problem.

20 Smooth James Stein Application to wave burst estimation and detection We illustrate how smooth James-Stein thresholding with SURE can be employed blockwise and levelwise for wavelet smoothing Q = 3 concomitant time series recorded to detect graviational wave bursts as described in Section 2. To that aim we consider three real series of length T = 2 14 (about 3.27 seconds of recording). Taking J = 4 in the wavelet expansion (2), we select 22 hyperparameters with SURE (11 levels with two hyperparameters each). We then add a so-called injection at time t = 1500, which represents an artificial wave burst. The injection is three times larger on the second captor than on the first, and five times smaller on the third captor, as plotted in dotted lines on Figure 4. Figure 3 shows the data (first line) recorded on each captor (columnwise), the estimate by blockwise smooth James-Stein smoothing (second line) and the estimate by employing a univariate smoothing per captor (third line). We observe that, as opposed to univariate smoothing, blockwise smoothing detects the injection and almost only the injection (except right before time t = ), whereas the other has many false detections. Zooming around the injection, Figure 4 shows that the estimation of the injection is also better with blockwise smoothing. Section 4.5 compares several estimators on a Monte-Carlo simulation. 4.4 Oracle inequality The l 2 risk for model (14) is N N R( ˆα, α) = ρ n (α n ) = E ˆα n α n 2 2. n=1 n=1

21 Smooth James Stein 21 Captor 1 Captor 2 Captor 3 data time Blockwise smoothing Univariate smoothing per captor time time time Figure 3: A wave burst injection at time t = 1500 added on three real captors noise (columnwise). Top: Time series of length T = Middle: Estimation with the smooth James-Stein estimator levelwise and blockwise with hyperparameters selected by SURE. Bottom: Estimation by applying a univariate smoothing per captor levelwise with hyperparameters selected by SURE. Following Donoho and Johnstone (1994), the oracle predictive performance of the block diagonal projection estimator ˆα n = δy, where δ n {0, 1} is α n 2, if δ n = 0, ρ n (δ n, α n ) = Q, if δ n = 1.

22 Smooth James Stein 22 Captor 1 Captor 2 Captor 3 Blockwise smoothing Univariate smoothing per captor time time time Figure 4: Zooming on the injection represented by a dotted line. The estimates are plotted with a continuous line: blockwise smoothing (first line), and a univariate smoothing per captor (second line). Hence the oracle hyperparameters are δ n = 1 { αn 2 >Q} for n = 1,..., N, and the corresponding oracle overall risk for this block-thresholding estimator is N R (δ, α) = min( α n 2, Q). n=1 The following theorem extends the oracle inequality obtained by Donoho and Johnstone (1994) for Q = 1 to block thresholding for Q 2. Theorem 1: Consider any fixed Q 2. Then there exists a sample size N 0 (Q) such that, for all N N 0 (Q) and with the universal threshold λ N,Q defined

23 Smooth James Stein 23 in (26), then the smooth James-Stein estimator (19) for ν = 1 and γ = 1 achieves the oracle inequality R(ˆα SJS λ N,Q,1;1, α) (Q + λ2 N,Q)(Q + R (δ, α)), where λ 2 N,Q = 2 log N + Q log log N 2 log Γ(Q/2). Proof: See Appendix A. This result shows we can mimic the overall oracle risk achieved with N oracle hyperparameters within a factor of essentially 2 log N + Q log log N 2 log Γ(Q/2) with a block thresholding estimator driven by a single hyperparameter. The smallest sample size N 0 (Q) for which the inequality holds is quite small in practice; more work is needed to get a tight expression. 4.5 Monte-Carlo simulation We reproduce the Monte-Carlo simulation of Johnstone and Silverman (2004) for Q = 1 captor to estimate a sparse sequence of length N = 1000 and of varying degrees of sparsity, as measured by the number of nonzero terms taken in {5, 50, 500} and by the value of the nonzero terms µ taken in {3, 4, 5, 7}. Table 1 reports estimated risks of four estimators: smooth James-Stein, generalized James-Stein, Subbotin l ν penalized likelihood (Sardy, 2009) and EBayesThresh (Johnstone and Silverman, 2004). The results clearly show the superiority of the smooth James- Stein estimator with SURE. Indeed the smooth- is always at least as good as the generalized-james-stein estimate, which proves excellent selection of the hyperparameters thanks to the smoothness of the estimate and of the SURE function when γ > 1. In case of extreme sparsity, only generalized James-Stein using the information criterion and EBayesThresh perform better. Note that this drawback

24 Smooth James Stein 24 of SURE has been explained by Donoho and Johnstone (1995, Section 2.4). Table 1: Monte-Carlo simulation for a sequence of length N = Average total squared loss of: the smooth James-Stein (SJS) estimator with hyperparameters selected with SURE, the generalized James-Stein (GJS) estimator with hyperparameters selected with SURE and SL w 2 IC, the Subbotin(λ, ν) posterior mode estimator with hyperparameters selected with SURE and SL ν IC, and the EBayesThresh estimator with Cauchy-like prior. Number nonzero Value nonzero µ = Q = 1 captor SJS SURE GJS SURE SL w 2 IC Subbotin SURE SL νic EBayesThresh Q = 3 captors SJS GJS SURE SL w 2 IC We consider a similar Monte-Carlo simulation with now Q = 3 concomitantly observed sequences: all three underlying sequences are identical in the location of the non zero entries, but not in their amplitude. Since three sequences carry more information than a single one, we may hope to distinguish noise from signal with a smaller signal to noise ratio. Hence we consider that the value of the nonzero terms are fixed to µ 1 = 1, µ 2 = 2 and µ 3 = µ is taken in {3, 4, 5, 7}. The estimated risk of the smooth James-Stein estimator is reported in Table 1. To allow some comparison with Q = 1, the estimated overall risk of Q N sequences is divided by Q. The smooth James-Stein estimator again performs best overall, while the SL w 2 IC selection rule performs better than for a single observed sequence, and the improvement from Q = 1 to Q = 3 is larger the more sparse the sequence.

25 Smooth James Stein 25 5 Smooth adaptive lasso We are now considering the classical parametric regression setting Y = µ + ɛ with µ = α Xα and ɛ i.i.d. N(0, 1), (28) where X is any N P regression matrix with a single response vector (i.e., Q = 1). Our goal here is to employ the smooth James-Stein thresholding function leading to a smooth version of adaptive lasso. We use ideas from optimization to implicitly define the smooth James-Stein estimate as a fixed point, and we derive an unbiased estimate of its risk using the implicit function theorem to obtain the Jacobian of the estimate with respect to the data. The additional smoothness of the estimator leads to a less erratic unbiased risk estimate that is easier to minimize as a function of its hyperparameters, as we observed in the previous section. 5.1 Definition When doing variable selection, we believe only a subset of the P covariates forms the true linear association, or equivalently that some entries α p of the coefficients vector α in (28) are zero. To estimate a sparse vector, one can solve the lasso penalized least squares 1 min β IR P 2 Y X Σβ λ ν P β p (29) ˆβ LS p ν 1 for a given regularization parameter λ. Note that in (29), the matrix X Σ means that the covariates have been mean-centered (to avoid penalizing the intercept) and rescaled, i.e., X Σ = (I 1 N J)XD Σ, where J is the N N matrix of ones and D Σ is a diagonal matrix that properly rescales the covariates; see Sardy (2008) p=1

26 Smooth James Stein 26 for a precise definition of Σ-rescaling. The estimator ˆβ λ solution to (29) is called lasso (Tibshirani, 1996) when ν = 1 and adaptive lasso (Zou, 2006) when ν > 1. Then the coefficients estimate of the original problem are ˆα λ = D Σ ˆβλ and ˆα 0 = ȳ 1 N 1T X ˆα λ. Adaptive lasso enjoys an oracle property in the sense that it can guess the true underlying model with a probability tending to one and behaves essentially like the least squares estimate on the correct model. Adaptive lasso is efficiently calculated thanks to the lars algorithm (Efron et al., 2004). In practice a stable selection of ν remains an open problem, to which we provide a solution below. We saw in the previous section (i.e., for X = I) both that the parameter ν can be selected by minimizing the Stein unbiased risk estimate and that its smooth James-Stein extension enjoys better predictive performance. Our goal here is to generalize both ideas to the regression setting (28). Relaxation, also called backfitting, is another possible algorithm to solve (29) (see Fu (1998); Sardy et al. (2000)). It allows to implicitly define the estimate as a fixed point to the algorithm in the following way. Starting with any initial guess for β, the relaxation algorithm keeps all entries constant except the pth entry updated to β new p = 1 ˆβ LS p ν 1 x p 2 2 ηλ soft ν LS ( ˆβ p ν 1 r T px p ), which is the solution to a convex univariate optimization subproblem, where r p = Y X p β old are the residuals without the pth term, and X p is the rescaled X Σ matrix where the pth column x p is replaced by a column of zeros. At the limit, a fixed point defines the solution which solves a system of nonlinear equations ˆβ p (Y) = 1 ˆβ LS p ν 1 x p 2 2 ηλ soft ν LS ( ˆβ p ν 1 r T px p )

27 Smooth James Stein 27 with r p = Y X p ˆβ(Y), p = 1,..., P. Instead of employing the almost differentiable soft thresholding, we iteratively employ the differentiable smooth James-Stein thresholding according to β new p = rt px p η soft LS λ ( ˆβ ν p ν 1 r T γ px p ) x p 2 2 ˆβ LS p ν 1 r T px p with r p = Y X p β old. The fixed point ˆβ p (Y) = rt px p η soft LS λ ( ˆβ ν p ν 1 r T γ px p ) x p 2 2 ˆβ LS p ν 1 r T px p with r p = Y X p ˆβ(Y), p = 1,..., P, (30) implicitly defines the smooth James-Stein estimate, called smooth adaptive lasso estimate. One checks that when X = I (in which case ˆβ LS p = y p for p = 1,..., N) then (30) is smooth James-Stein thresholding previously defined (19). The cases γ = 0 and γ = 1 are known to converge to a fixed point since it corresponds to solving the least squares and adaptive lasso, respectively. We have studied and always observed convergence to a unique estimate empirically for a whole range of values for γ. So we conjecture convergence to a unique fixed point for all γ 0. The assumption that the columns of X are linearly independent is sufficient to guarantee a unique (adaptive) lasso estimate, but can be relaxed under certain conditions (Sardy, 2009, Theorem 3). 5.2 Unbiased risk estimate Provided the estimator ˆµ = g(y) + Y with g(y) = ȳ1 + X Σ ˆβ(Y) Y and ˆβ(Y) defined in (30) is almost differentiable with respect to the data, we can employ the

28 Smooth James Stein 28 Stein unbiased risk estimate to select (λ, ν). This requires the partial derivatives g n (Y)/ Y n = x T n n ˆβ(Y) 1 + 1/N n = 1,..., N, (31) where n ˆβ(Y) are the derivative of ˆβ(Y) with respect to Yn and x 1,..., x N are the rows of X Σ. The following theorem shows how this can be done. Theorem 2: Let ˆβ LS = AY where A = (X T ΣX Σ ) X T Σ = (a pn ) (using the generalized inverse if necessary). For a given (λ > 0, ν 1), the smooth adaptive lasso (30) estimate ˆβ(Y) is continuously differentiable for γ > 1. Moreover let I 0 = {p {1,..., P } : ˆβ p (Y) = 0} = {p i } i=1,..., I0 } of cardinal I 0, let Ī0 be its complement, and let XĪ0 Σ be the columns of X Σ with an index in Ī0. For a given n, the non-zero elements hī0 n of n ˆβ(Y) are solution to the following system of Ī 0 linear equations with as many unknowns: x T p XĪ0 ΣDĪ0 p hī0 n = z ni p = p i, i = 1,..., Ī0, (32) where z ni = x np + u p v p, (33) and DĪ0 p is the identity matrix except that its ith diagonal element is DĪ0 p,ii = w γ p 1 γ + γw p =: v 1 p (34) with w p = η soft λ ν LS ˆβ p ν 1 r T px p ( ˆβ LS p ν 1 r T px p ) and u p = γ(ν 1)a pn(w p 1)r T px p. (35) ˆβ LS p w γ p

29 Smooth James Stein 29 Solving (32) leads to the gradient n ˆβ(Y) needed in (31) to calculate SURE(λ, ν; γ) for all λ, ν and γ. Proof: See Appendix B. Note that we used the least squares estimate as raw estimate of the coefficients, but the formula would still hold for other linear estimators. At the limit when γ tends to one, the adaptive lasso has DĪ0 p,ii = 1 for all i in (34). A particular case of interest is the lasso (i.e., ν = 1) for which we have z ni = x np in (33), so that (32) leads to hī0 n = ((XĪ0 ) T 1 XĪ0 ) ( xī0 n ) T. Hence, from (31), we see that N g n (Y)/ Y n = N N x T n n ˆβ(Y) n=1 n=1 = N N n ((XĪ0 Σ) T 1 XĪ0 Σ) ( xī0 n ) T. n=1 = N trace(((xī0 Σ) T 1 XĪ0 Σ) ((XĪ0 Σ) T XĪ0 Σ)) = N Ī0, where Ī0 is lasso s degrees of freedom previously found by Zou et al. (2007). In practice we first minimize SURE over (λ, ν) for γ = 1 (i.e., adaptive lasso) which we do efficiently on a fine grid thanks to lars (Efron et al., 2004). This provides a neighborhood for λ, namely [λ/10, 10λ], over which we then search on a grid for the minimizer of SURE for smooth adaptive lasso (γ > 1). This strategy provides a good and efficient selection of the pair of hyperparameters. 5.3 Monte-Carlo simulation We replicate the Monte-Carlo simulation of Zou (2006) with P = 8 covariates with corresponding coefficients either sparse (α = (3, 1.5, 0, 0, 2, 0, 0, 0) for Model 1 with N {20, 60}), or not sparse (α = (.85,.85,.85,.85,.85,.85,.85,.85) for Model 2

30 Smooth James Stein 30 with N {40, 80}). The covariates x n (n = 1,..., N) are i.i.d. Gaussian vectors with pairwise correlation between x n,i and x n,j given by cor(i, j) = (.5) i j. The noise is Gaussian with standard error σ {1, 3, 6}. Like Zou (2006) we consider the relative prediction error RPE = E[x T test{ ˆα([X, Y] training ) α}]/σ 2, where the expectation is taken over training and test sets, and response. Note that we calculate RPE given (X, Y) training exactly using the knowledge of the distribution of the covariates (Zou relies on 10,000 test observations instead). This predictive measure is reported in Table 2 to compare lasso, adaptive lasso and smooth adaptive lasso using two-fold cross-validation to select their hyperparameter(s). We observe that smooth adaptive lasso improves significantly over adaptive lasso and over lasso when the underlying model is sparse and the noise is small; the estimation is slightly worse for the non-sparse model. We also consider selection of the hyperparameters with SURE to compare adaptive lasso and its smooth version. To compare the two estimators, we consider the relative prediction error conditional on the covariates of the training set RPE X = N n=1 E[x T training,n{ ˆα([X, Y] training ) α}]/σ 2, where the expectation is taken over the response only. This measure, although less useful in practice unless prediction is sought at the same locations as the training set, helps quantify the improvement of using the smooth James-Stein thresholding. Table 3 reports the results, which shows a systematic gain of the smooth version over adaptive lasso. Lasso is often better for the non-sparse model. We also report in Table 4 the number C of selected nonzero components and the number I of zero components incorrectly selected. We observe that the selection is correct when the noise is small (σ = 1) with smooth/adaptive lasso based on SURE, but that some false detection occurs when the noise grows (σ = 3).

31 Smooth James Stein 31 Table 2: Zou s Monte-Carlo simulation: median RPE using two-fold crossvalidation to select the hyperparameter(s) on 100 training sets. σ = 1 σ = 3 σ = 6 Model 1 (N = 20) Lasso (0.048) (0.069) (0.021) Adaptive lasso (0.051) (0.057) (0.021) Smooth adaptive lasso (0.046) (0.056) (0.020) Model 1 (N = 60) Lasso (0.009) (0.008) (0.010) Adaptive lasso (0.009) (0.009) (0.011) Smooth adaptive lasso (0.009) (0.009) (0.010) Model 2 (N = 40) Lasso (0.014) (0.020) (0.010) Adaptive lasso (0.015) (0.021) (0.010) Smooth adaptive lasso (0.015) (0.021) (0.010) Model 2 (N = 80) Lasso (0.005) (0.005) (0.005) Adaptive lasso (0.005) (0.005) (0.005) Smooth adaptive lasso (0.005) (0.005) (0.005) Table 3: Zou s Monte-Carlo simulation: median RPE X at the training covariates X using SURE to select the hyperparameter(s) on 100 training sets. σ = 1 σ = 3 σ = 6 Model 1 (N = 20) Lasso (0.021) (0.020) (0.017) Adaptive lasso (0.023) (0.023) (0.019) Smooth adaptive lasso (0.023) (0.025) (0.019) Model 1 (N = 60) Lasso (0.008) (0.008) (0.007) Adaptive lasso (0.008) (0.008) (0.007) Smooth adaptive lasso (0.008) (0.008) (0.007) Model 2 (N = 40) Lasso (0.010) (0.010) (0.008) Adaptive lasso (0.011) (0.013) (0.009) Smooth adaptive lasso (0.011) (0.013) (0.008) Model 2 (N = 80) Lasso (0.004) (0.004) (0.003) Adaptive lasso (0.004) (0.006) (0.004) Smooth adaptive lasso (0.004) (0.006) (0.004)

32 Smooth James Stein 32 Table 4: Median number of Selected Variables for Model 1 with n = 60 σ = 1 σ = 3 C I C I Truth Lasso Adaptive lasso SCAD Garotte Adaptive lasso SURE Smooth adaptive lasso SURE Results taken from the Monte-Carlo simulation of Zou (2006). 5.4 Smooth SURE for the prostate cancer data We illustrate the estimation of the l 2 risk and the smoothness gained with smooth adaptive lasso on the prostate cancer data with P = 8 covariates used by Tibshirani (1996). Figure 5.4 shows the estimated risk as a function of the two hyperparameters λ and ν, either for adaptive lasso (left) or for the smooth adaptive lasso (right). Both estimations of the risk are unbiased, but the latter is less erratic. The advantage of smoothness would be even more pronounced with data containing a larger collection of covariates. The optimal hyperparameters (ˆλ, ˆν) = (1.12, 10) select a subset with five covariates out of the eight original ones. 6 Further extensions Smooth James-Stein thresholding (19) offers additional flexibility to fit the underlying sparsity thanks to the parameters (λ, ν) for which we proposed selection rules. Interestingly, this thresholding function relies on the l 2 norm. This measure may not be appropriate in certain situations. Indeed if one wants to measure departure from the zero vector in a sense that all entries must be different from zero,

33 SURE SURE Smooth James Stein 33 nu nu log(lambda) log(lambda) Figure 5: Prostate cancer data: Stein unbiased risk estimate as a function of λ and ν for adaptive lasso (left, with γ = 1) and smooth adaptive lasso (right, with γ = log ν)). then Y 2 in (19) should possibly be replaced by ˆα n,λ,ν;γ = (1 λ ν min q=1,...,q Y nq ν )γ +Y n λ 0, ν > 0, γ 1, Y n IR Q,(36) in which case the universal threshold is λ N,Q = 2/Q log N. Other quantiles could be considered, or other norms like the l 1 norm to penalize less than the l 2 norm a large departure from zero in the entries of Y. For the group lasso, Yuan and Lin (2006) approximate the degrees of freedom for their estimator. One can instead follow the derivation of Theorem 2 to derive the exact degrees of freedom. Consider the smooth group adaptive lasso, a

34 Smooth James Stein 34 generalization of group lasso (13), defined as ˆβ j (Y) = s j (1/w j ) γ with w j = LS ˆβ j ν 1 2 s j 2 ηλ soft ν ( ˆβ LS j ν 1 2 s j 2 ) s j = X T j r j r j = Y X j ˆβ(Y) j = 1,..., J, (37) where the coefficients vector β IR P is segmented as β = (β,..., β J ) in J blocks of respective size p j such that J j=1 p j = P. Likewise ˆβ LS = (ˆβ LS,..., ˆβ LS J ) and X = [X 1... X J ] with X T j X j = I pj for j = 1,..., J. Note that (37) is the generalization of (2.4) from Yuan and Lin (2006) taking K j = I pj in their notation. Let J 0 = {j {1,..., J} : ˆβ j (Y) = 0} = {j i } i=1,..., J0 } of cardinal J 0. Deriving SURE in J that setting requires calculating the nonzero elements h 0 n of the gradient n ˆβ(Y) of the estimated vector defined by (37) with respect to the data Y n, n = 1,..., N. Following similar derivation as in Appendix B, ones finds that h J 0 n a set of linear equations of the form (32), more precisely are defined by C j X T j X J 0 J D 0 j h J 0 n = z nj j = j i, i = 1,..., J 0, (38) where z nj are vectors of length p j and D J 0 j is a block diagonal matrix made of identity matrices, except for the jth block equal to Cj 1, with C j = e j s j s T j + f j I pj is a positive definite matrix since e j = γ(1/w j ) γ 1 λ ν / ˆβ LS j ν 1 2 / s j 3 2 > 0 and f j = (1/w j ) γ > 0. Based on (38), SURE can be calculated and minimized over all values of γ 1 and ν 1 to provide the equivalent degrees of freedom of (smooth adaptive) group lasso. Unless blocks are of unit size, in which case there is no more grouping of covariates, there does not seem to be a close form expression for the degrees of freedom of group lasso when ν = 1 and γ 1.

Smooth blockwise iterative thresholding: a smooth fixed point estimator based on the likelihood s block gradient

Smooth blockwise iterative thresholding: a smooth fixed point estimator based on the likelihood s block gradient SYLVAIN SARDY Department of Mathematics, University of Geneva The proposed smooth blockwise