SYLVAIN SARDY. Department of Mathematics, University of Geneva. added flexibility thanks to more than one regularization parameters

Size: px
Start display at page:

Download "SYLVAIN SARDY. Department of Mathematics, University of Geneva. added flexibility thanks to more than one regularization parameters"

Transcription

1 Smooth James-Stein model selection against erratic Stein unbiased risk estimate to select several regularization parameters SYLVAIN SARDY Department of Mathematics, University of Geneva Smooth James-Stein thresholding-based estimators enjoy smoothness like ridge regression and perform variable selection like lasso. They have added flexibility thanks to more than one regularization parameters (like adaptive lasso), and the ability to select these parameters well thanks to a unbiased and smooth estimation of the risk. The motivation is a gravitational wave burst detection problem from several concomitant high-frequency time series. A wavelet-based estimator is developed to combine information from all captors by blockthresholding multiresolution coefficients. We derive a universal threshold, an information criterion and an oracle inequality for this estimator. Smooth James-Stein thresholding is then employed in parametric linear regression for which we derive a formula for unbiased risk estimation. 2-4 rue du Lièvre, CP 64, 1211 Genève 4, Switzerland; sylvain.sardy@unige.ch 1

2 Smooth James Stein 2 We perform a Monte-Carlo simulation to quantify the improvement and investigate the oracle property of smooth adaptive lasso empirically. Introducing the smoothness parameter allows smooth versions of existing estimators, for which we derive their equivalent degrees of freedom. Conversely, letting the smoothness parameter tend to one, we derive the equivalent degrees of freedom of lasso, adaptive lasso and group lasso. So far, only that of lasso is known. Keywords: adaptive lasso, information criterion, James-Stein estimator, multivariate time series, sparse model selection, universal threshold, wavelet smoothing

3 Smooth James Stein 3 1 Introduction The James-Stein estimator (James and Stein, 1961) lead to the idea that the MLE could be regularized to achieve better risk. Many methods of regularization (e.g., ridge regression (Hoerl and Kennard, 1970), Waveshrink (Donoho and Johnstone, 1994), nonnegative garrote (Breiman, 1995), lasso (Tibshirani, 1996)) are driven by this idea and control the bias-variance trade-off with a single regularization parameter chosen to minimize an estimate of the risk. For instance cross validation (Stone, 1974) is easy to implement to estimate the risk for any chosen loss function, but is known for its instability and its high computational cost. Generalized cross validation (Golub et al., 1979), information criteria (Akaike, 1973; Schwarz, 1978) or Stein unbiased risk estimate (Stein, 1981) can alleviate those drawbacks. More recently, estimators with several regularization parameters (e.g., bridge (Fu, 1998), SCAD (Fan and Li, 2001a), EBayesTresh (Johnstone and Silverman, 2004, 2005), fused- (Tibshirani et al., 2005), elastic net (Zou and Hastie, 2005), adaptive-lasso (Zou, 2006) and l ν -regularization (Sardy, 2009)) have been developed. It is often believed that, although these estimators are more flexible, their risk may be worse because selection of more than one regularization parameter is unstable. One goal of this paper is to prove the contrary thanks to smooth James-Stein thresholding. We consider two situations where we employ smooth James-Stein thresholding. Section 2 presents the original motivation for our work, gravitational wave burst detection, in a nonparametric setting that we analyze in Section 4 after reviewing important results in modern model selection in Section 3. Section 5 considers the parametric regression setting with measured covariates. These two settings show that smooth James-Stein thresholding helps select the hyperparam-

4 Smooth James Stein 4 eters better than with standard (non-smooth) thresholding. Moreover it offers a continuum between estimators by linking and extending the James-Stein estimator, soft-, hard- and block-thresholding (Donoho and Johnstone, 1994; Cai, 1999) in wavelet smoothing, and group lasso (Yuan and Lin, 2006) and adaptive lasso in variable selection. 2 Motivation Gravitational wave bursts are expected to be produced by energetic cosmic phenomena such as the collapse of a supernova (Klimenko and Mitselmakher, 2004). They are believed to be rare, highly oscillating and of small intensity. Because the signal-to-noise ratio is believed to be low, only the concomitant recording by Q captors (typically Q = 3) of the cosmic phenomena may help prove the existence of such wave bursts. Note that the captors are located far apart from each other to avoid recording local earth phenomena (such as an earthquake) on all Q captors. Another difficulty is the non-white nature of the noise, possibly non-gaussian. Finally the measurements are recorded at a high frequency of 5MHz; one minute of recording has 3Q 10 5 noisy measurements. A good model for these data is S (q) t = µ (q) (t) + ɛ (q) t, t = 1,..., T, q = 1..., Q (1) where the noises ɛ (q) and ɛ (q ) are independent between captors q q, but where the noise ɛ (q) t is temporally correlated for a given captor q. Importantly, most of the time we have on the one hand that the underlying signal µ (q) (t) = 0 for all q, but on the other hand that if µ (q) (t) 0 for a given time t and captor q, then µ (q ) (t) 0 for all other captors q at the same time t. Moreover when a wave burst occurs, we

5 Smooth James Stein 5 may not have µ (q) (t) = µ (q ) (t), but only a proportionality constant relate them, because the incoming wave burst may not hit the captor with the same angle. We assume each underlying signal µ (q) expands on T = N orthonormal approximation φ and fine scale ψ wavelets: µ (q) (t) = 2 j 0 1 k=0 β 0 (q) k φ j 0,k(t) + J j=j 0 N j 1 n=0 β (q) j,nψ j,n (t), (2) where the wavelets are obtained by dilation j and translation n, J = log 2 (N) and N j = 2 j ; see Donoho and Johnstone (1994). One can extract an orthonormal regression matrix W = [Φ 0 Ψ j0... Ψ J ] from this representation such that (1) becomes S (q) = W β (q) + ε (q). Applying the orthonormal wavelet decomposition W T, the model can also be written as S (q) = β (q) + ε (q), (3) where S (q) = W T S(q) and ε (q) = W T ε (q). This latter model is interesting in three aspects. First Johnstone and Silverman (1997) show that stationary correlated noise ε q is well decorrelated by a wavelet transform within and between levels. For a given level j, the process has its own wavelet variance σ 2,(q) j (Percival, 1995; Serroukh et al., 2000) that depends on captor q and that can be estimated from the data (Donoho and Johnstone, 1995). Figure 2 shows (left) the estimated spectrum from 26 seconds of recording which shows a strong colored noise process, and (right) boxplots of 31 estimated wavelet variances on 8 levels from disjoint segments of signals of length N = 4096 from one captor. Second, the data are Gaussianized by the linear transformation W T. The last interesting property is that, owing to the sparse wavelet representation of many signals, the vectors β (q) are sparse. Moreover, letting β (q) = (β (q) j 0,..., β (q) ), the sparsity (percentage of null entries) J

6 Smooth James Stein 6 varies from level j to level j, so that the estimation should adapts to sparsity levelwise by selecting hyperparameters at each level. Consequently model (3) can be written at a given level j {j 0,..., J} as Y (q) j = α (q) j + ε (q) j with α (q) j Y (q) j = β (q) j /σ (q) j = S (q) j /σ (q) j, (4) where α (q) j = (α j,1, (q)..., α (q) j,n j ) is a sparse vector and ε (q) j i.i.d. N(0, 1). The important fact to recall is that, given wavelet level j {j 0,..., J} and translation Estimated spectrum Boxplot of estimated wavelet variance spectrum 1e 11 1e 08 1e 05 1e 02 estimated log(sigma_j) frequency bandwidth = wavelet level j Figure 1: EDA from 26 seconds recording: (left) estimated spectrum and (right) boxplots of estimated wavelet variances on a log-scale from segments of one second for level 4 to 11.

7 Smooth James Stein 7 n {1,..., N j }, then α j,n = (α (1) j,n,..., α (Q) j,n ) has either all entries null α j,n = 0 (due to no wave burst, or to no component of the wave burst on wavelet n at level j), or, when α j,n = 0, then its entries are typically all different (due to captors facing the arriving wave burst at different angles, or to different wavelet variances σ 2,(q) j and σ 2,(q ) j for captors q q ). Our goal is to derive an estimator that reflects this block sparsity and heterogeneity level-by-level. 3 Threshold, penalty and model selection James and Stein (1961) showed the surprising result that for estimating a vector α IR P with P > 2 from a Gaussian measurement Y N(α, I), the maximum likelihood estimate (MLE) is not admissible with respect to the l 2 -risk, R( ˆα, α) = E ˆα α 2 2. Indeed they showed that ˆα JS (Y) = (1 P 2 Y 2 2 )Y verifies R( ˆα JS, α) < R( ˆα MLE, α) α IR P, P > 2. The James-Stein estimator above does not shrink (because the factor 1 P 2 Y 2 2 be less than 1) and does not threshold (i.e., P( ˆα JS (Y) = 0) = 0). The drawback that the factor can be negative has been alleviated by taking its positive part can ˆα JS+ (Y) = (1 P 2 ) Y 2 + Y, (5) 2 which makes it not only a shrinkage estimator but also a thresholding estimator. Note that this latter estimator, although better than its previous version in terms

8 Smooth James Stein 8 of l 2 -risk, remains inadmissible. Interestingly the James-Stein estimator has the following oracle inequality R( ˆα JS+, α) 2 + inf c IR R( ˆαc, α), (6) with respect to the linear estimator ˆα c = c ˆα MLE, for which the oracle factor is c = α 2 2/(P + α 2 2) and the optimal l 2 -risk is R (c, α) = inf c R( ˆα c, α) = P α 2 2/(P + α 2 2). Candès (2005) provides an interesting review on oracle inequalities. This result lead to the theme of regularization in the form of shrinkage, penalized likelihood, thresholding, or posterior distribution. Hoerl and Kennard (1970) with ridge regression, Good and Gaskins (1971) with penalties, Wahba s smoothing splines (see Wahba (1990) for a recent account) are examples of early key contributions. More recent contributions are Donoho and Johnstone (1994) with Waveshrink and Tibshirani (1996) with lasso. Both estimators assume the model Y N(Xα, I), where X is an N P matrix of wavelets and of covariates, respectively. They aim at estimating α while enforcing sparsity: some entries of the estimated coefficients ˆα are set to zero for sparse wavelet representation and model selection purposes, respectively. On the one hand Waveshrink (which uses an orthonormal matrix X) regularizes the wavelet coefficients ˆα MLE = X T Y by thresholding them componentwise towards zero with the nonlinear hard or soft thresholding functions η hard ϕ (x) = x 1 { x ϕ}, (7) η soft ϕ (x) = sign(x)( x ϕ) +, (8) where ϕ is the threshold: the estimate ˆα n,ϕ = η ϕ (ˆα MLE n ) is zero if ˆα MLE ϕ n

9 Smooth James Stein 9 for n = 1,..., N. Interestingly, when choosing the universal threshold ϕ N = 2 log N, the estimators have the following oracle inequality R( ˆα hard/soft ϕ N, α) (2 log N + 1)(1 + inf δ {0,1} N R( ˆαδ, α)), (9) with respect to the diagonal projector ˆα δ = diag(δ) ˆα MLE, for which the oracle binary factors are δ n = 1 { ˆα MLE n 1} and the optimal l 2-risk is R (δ, α) = inf δ {0,1} N R( ˆα δ, α) = N n=1 min( ˆα n MLE 2, 1). Comparing (9) with (6), the diagonal projector has a bound log N increasing with N as opposed to the constant bound 2 of the linear estimator, but the oracle risk of (9) uses N oracle parameters as opposed to one for (6) and has therefore more degrees of freedom. The universal threshold has asymptotic minimax properties (Donoho et al., 1995). On the other hand lasso regularizes the MLE by adding an l 1 penalty to the likelihood part and by estimating α solving 1 min α 2 Y Xα λ α 1, (10) for a given penalty parameter λ 0. Interestingly, when X = I and λ = ϕ then the soft shrinkage function applied componentwise to Y is the closed form solution to (10) (Donoho et al., 1992); for other thresholding penalties see Antoniadis and Fan (2001). Lasso is not an oracle procedure (Fan and Li, 2001b), so, for a given ν > 1, Zou (2006) proposed the adaptive lasso by solving 1 min α 2 Y Xα λ P p=1 1 ˆα p ν 1 α p, (11) which is oracle provided λ is properly chosen and ˆα p are root-n-consistent estimates for p = 1,..., P, e.g., the least squares estimate.

10 Smooth James Stein 10 Waveshrink and lasso have been extended to achieve thresholding blockwise, as opposed to componentwise. For instance Cai (1999) proposed the James-Stein-like estimator ˆα Block jb = (1 λl ˆα MLE ) jb 2 + ˆα MLE jb (12) 2 to threshold block b made of L = log N raw wavelet coefficients ˆα MLE jb at each resolution level j. He showed excellent asymptotic properties and an oracle inequality similar to (6) but with a constant factor of (instead of one) in front of the oracle risk. Likewise Bakin (1999) and Yuan and Lin (2006) proposed group lasso to threshold blocks of coefficients by arbitrarily splitting the vector α into J blocks (α 1,..., α J ) and solving 1 J min α 2 Y Xα λ α j 2, (13) j=1 where α j 2 = pj i=1 α 2 ji and p j is the length of the vector α j, j = 1,..., J. Hence when J = P, then all blocks have size one and the estimator becomes the standard lasso (10). The paper proposes to unify and extend these thresholding estimators with two goals in mind: offer a continuum of estimators for more flexibility, and select the indexes of the continuum in an efficient way for better estimation, first in the canonical regression setting, then in standard parametric regression. The paper is organized as follows. In Section 4, for the canonical regression problem with concomitant measurements, we define the generalized and smooth James-Stein estimators that operate blockwise and select two hyperparameters. Section 4.2 is concerned with the selection of the two hyperparameters for which we propose two rules: (1) the Stein unbiased risk estimate rule of Section shows that the more smooth the estimator with respect to the data, the less erratic, yet still unbiased,

11 Smooth James Stein 11 the estimation of its risk; (2) the universal threshold and an information criterion rule of Section are based on the asymptotic property that a zero sequence should be correctly estimated with a probability tending to one for a threshold increasing at a slow rate. Section 4.3 comes back to our gravitational wave burst detection problem to illustrate the methodology. For blocks of a constant size, Section 4.4 derives an oracle inequality for block thresholding with the universal threshold. Section 4.5 investigates the finite sample performance of the proposed estimator on a Monte-Carlo experiment. Finally, motivated by the excellent performance of smooth James-Stein thresholding, Section 5 considers the classical linear regression setting for the selection of covariates. We implicitly define the smooth James-Stein estimator as a fixed point and derive its equivalent degrees of freedom as a function of the two hyperparameters. This gives the equivalent degrees of freedom of the adaptive and group lasso as a particular case. We illustrate the methodology with a Monte Carlo simulation and the prostate cancer data. 4 Generalized and smooth James-Stein thresholding 4.1 Definition We consider model (4) that we rewrite componentwise and for simplicity without wavelet level j as Y (q) n = α (q) n + ɛ (q) n, n = 1,..., N, q = 1..., Q (14)

12 Smooth James Stein 12 where the Gaussian noise is i.i.d. N(0, 1) (independent between q q and within sequences). To fix notation the vectors Y n = (Y n (1),..., Y n (Q) ) measure α n = (α n (1),..., α n (Q) ) which are believed to be the zero vector most of the time. To threshold vectors to zero, soft-waveshrink can be generalized in the spirit of group lasso to the following block-soft-waveshrink estimator that solves 1 min α n IR Q,n=1,...,N 2 N N Y n α n λ α n 2. (15) n=1 n=1 This penalized least squares optimization can be separated in N subproblems of dimension Q. In turn one can show that the solution to each Q-variate subproblem is a vector collinear with Y n, but thresholded towards zero. Indeed the solution to (15) is ˆα n,λ = (1 λ Y n 2 ) + Y n, n = 1,..., N, (16) which is reminiscent of the James-Stein estimator (5) and different from Cai s estimator (12) in that Q is fixed and does not increase with N. Interestingly the 2-norm in (16) is not squared as in (5). To bridge between the two estimators and extend them, we propose first the generalized James-Stein (GJS) thresholding estimator ˆα n,λ,ν = η GJS λ,ν (Y n ) := (1 λν ) Y n ν + Y n λ 0, ν > 0, Y n IR Q (17) 2 that encompasses (16) for ν = 1 and (5) for ν = 2. Expressed as penalized least squares, the GJS estimator is solution to 1 min α n 2 Y n α n λ ν 1 Y n ν 1 α n 2, (18) 2

13 Smooth James Stein 13 which can also be seen as a group version of adaptive lasso. With this parametrization, λ is always the threshold value regardless of ν. Note that ν = 1 corresponds to soft-thresholding (8), ν corresponds to hard (7), and 0 < ν < 1 is a thresholding function that biases the estimate even more than soft thresholding. Since GJS thresholding is almost differentiable, we consider the Stein unbiased risk estimate (Stein, 1981) to select both hyperparameters in the next section. The univariate thresholding function (Q = 1) of the generalized James-Stein estimator is plotted on the top left corner of Figure for ν = 10 and λ = 3. The choice ν = 10 makes the thresholding function close to hard thresholding. A property that reveals important in Section is that the thresholding function, although almost differentiable, is forced to have a large right derivative η λ,ν(λ + ) = ν at the threshold λ for large ν since the thresholding function converges to hard when ν. This motivates the smooth James-Stein estimator ˆα n,λ,ν;γ = η SJS λ,ν;γ(y n ) := (1 λν ) γ Y n +Y ν n λ 0, ν > 0, γ 1, Y n IR Q (19) 2 that uses a thresholding function with smooth higher derivatives for reasons we now explain. 4.2 Hyperparameters selection The added flexibility translates into better estimation of α provided both λ and ν can be well selected from the data. To that aim we propose two selection rules. The first one minimizes an unbiased estimate of the risk and the second one minimizes an information criterion. The second one is preferable in a situation of strong sparsity and the first one when the underlying vector is more dense.

14 Smooth James Stein Stein unbiased risk estimate For the multivariate canonical model (14), we first consider the generalized James- Stein estimator (17). Its corresponding thresholding function is almost differentiable in Stein sense (Stein, 1981), and one finds that the Stein unbiased risk estimate is SURE(λ, ν) = N N Y n 2 λ 2ν Y n ( Y n 2 λ) + 1( Y n=1 n=1 Y n 2ν n 2 > λ) 2 N +2 (Q λν (Q ν) )1( Y n=1 Y n ν n 2 > λ) NQ. (20) 2 This bivariate function can be minimized over λ and ν to select both hyperparameters. Section 4.5 investigates the finite sample performance of this selection with a Monte-Carlo simulation. It reveals not as good as the extension of SURE for the non-almost differentiable thresholding obtained by l ν penalized likelihood (Sardy, 2009). This drawback of SURE applied to the generalized James-Stein estimator can be alleviated in the following way. The SURE function (20) is discontinuous in λ at all Y n 2, n = 1,..., N. Consequently the unbiased risk estimate function is erratic in λ (i.e., high variance) and hard to minimize. This is particularly true when ν is large compared to Q because the jumps at the discontinuity points are large when Q ν is large, as we can see on the second line of (20). This is due to the fact that generalized James-Stein thresholding is discontinuous in its first derivative and that the larger ν the steeper the slope at the threshold value to approximate the hard thresholding function. To overcome this drawback, we propose the smooth James-Stein estimator (19), which not only thresholds but also has smoother higher derivatives for γ > 1, and

15 Smooth James Stein 15 corresponds to (17) for γ = 1. The corresponding Stein unbiased risk estimate SURE(λ, ν; γ) = N N Y n 2 λ 21( Y n 2 λ) + 2Q {1 [ ] ν } γ + n=1 n=1 Y n 2 N λ + ({1 [ ] ν } γ 1) 2 Y n 2 n=1 Y n 21( Y n 2 > λ) 2 N λ +2γν {1 [ ] ν } γ 1 λ + [ ] ν NQ (21) n=1 Y n 2 Y n 2 is consequently less erratic. Figure illustrates the advantage of using a large γ when estimating the risk with SURE for a large ν. In practice, γ could be estimated from the data by minimizing a three-dimensional SURE function. We adopt a more pragmatic approach: we set γ = 2 log ν + 1 so that the thresholding function departs from the hard thresholding function and increases its smoothness only slightly as ν gets larger (thanks to the slowly increasing logarithmic function on ν), and so that for ν = 1 we fall back to the soft thresholding function (Donoho and Johnstone, 1995) since γ = 1 in that case. Section 4.5 shows that this choice leads to better risk estimation with (21) for free γ than with (20) for γ = 1. This property implies better selection of the two hyperparameters, and hence provides the smooth James-Stein estimator with better l 2 risk, as the Monte-Carlo simulation shows in Section 4.5. Model (14) and the SURE formulas (20) and (21) assume unit variance for the noise. In practice, one can estimate the variance and rescale the responses to have approximate unit variance. To avoid estimating the variance, one can observe that the Stein formula (20) and (21) can be written in the form RSS+2σ 2 d.f.+c, where d.f. stands for degrees of freedom. Like generalized cross validation (Golub et al.,

16 Smooth James Stein 16 8 Subbotin l ν penalty: ν=.4 Thresholding function 8 Generalized James Stein: ν=10, γ=1 Thresholding function for λ=3 8 Smooth James Stein: ν=10, γ=2log ν+1 Thresholding function for λ= η λ,ν (x) 0 η λ,ν (x) 0 η λ,ν;γ (x) x x x 1400 True loss and SURE 1400 True loss and SURE 1400 True loss and SURE SURE(λ, ν=.4) SURE(λ,ν=10) SURE(λ,ν=10;γ=2log ν +1) log λ log λ log λ Figure 2: Thresholding functions and corresponding Stein unbiased risk estimation for Q = 1 and ν = 10 on simulated data of length N = Left: Subbotin l ν penalized least squares; Middle: generalized James-Stein (γ = 1); Right: smooth James-Stein for a smoothness parameter γ = 2 log ν + 1. Top: thresholding functions with parameters chosen to approximate hard thresholding. The left one is discontinuous at the threshold, but with a small slope at the threshold, the middle one has a sharp change of the left and right derivatives at the threshold, and the right one has a smooth change of derivative at the threshold. Bottom: corresponding Stein unbiased risk estimate (blue curve) and true loss (red curve). We observe on the same data that the right estimation is less erratic than the middle one. 1979), one can define a generalized SURE as GSURE(λ, ν; γ) = RSS(λ, ν; γ)/n (1 d.f.(λ, ν; γ)/n) 2, which is a first order approximation to the exact SURE formula and which estimates the variance by ˆσ 2 = RSS(λ, ν; γ)/n. In practice SURE can be minimized on a fine grid thanks to rapid calculation of the estimate (19) for any give pair (λ, ν).

17 Smooth James Stein Universal threshold and information criterion We derive a universal threshold and an information criterion for the generalized James-Stein estimator. The idea is to approximate the distribution of the smallest threshold λ Y that, for a sample Y = (Y 1,..., Y N ) of size N, sets to zero all N blocks of length Q when the true underlying model is made of zero vectors. Controlling the maximum of λ Y then leads to a finite sample λ N,Q and asymptotic λ N,Q universal threshold, and a prior π λ, which are used to define the information criterion as for the sparsity l ν information criterion SL ν IC (Sardy, 2009), as we now show. With the assumption that Y n N Q (0, I Q ) for n = 1,..., N, we seek the smallest threshold λ N,Q such that the generalized James-Stein estimator (17) estimates the right model with a probability tending to one: P( ˆα 1,λ,ν = 0,..., ˆα N,λ,ν = 0) = P( Y λ 2 N,Q,..., Y N 2 2 λ 2 N,Q) N 1. (22) The distribution of M N = max N n=1 Y n 2 2, where Y n 2 2 i.i.d. χ 2 Q = Γ(Q/2, 1/2), is degenerate. Extreme value theory provides proper rescaling of M N for c 1 N (M N d N (Q)) d G 0 (x), where G 0 (x) = exp( exp( x)) is the Gumbel distribution, c N = 2 and d N (Q) is the root in ξ to log N log Γ(Q/2) = (1 Q/2) log(ξ/2) + ξ/2. (23) The normalizing constant d N (Q) = 2(log N+(Q/2 1) log log N log Γ(Q/2)) given by Embrechts et al. (1997, p.156) for the Gamma distribution is an asymptotic approximation to the root of (23), and leads to a good Gumbel approximation

18 Smooth James Stein 18 when N is large compared to Γ(Q/2). In that case we define the asymptotic universal threshold λ N,Q = 2(log N + (Q/2) log log N log Γ(Q/2)), (24) for which the asymptotic property (22) is satisfied since P( Y λ 2 N,Q,..., Y N 2 2 λ 2 N,Q) = G 0 (log log N) 1 1/ log N N 1. Note that for Q = 1 we get back the standard universal threshold 2 log N up to a small term, and for Q = 2 we get back the universal threshold of Sardy (2000) for denoising complex-values wavelet coefficients. When Q gets large however, the proposed normalizing constant d N (Q) is too far from the exact root to provide a useful approximation, so we find the root d N (Q) of (23) numerically, which exists provided N is large enough (e.g., N > log(γ(q/2)) + (Q/2 1)(1 log(q/2 1))). The finite sample universal threshold is then defined as λ N,Q = (25) d N (Q) + c N log log N with d N (Q) root of (23) (26) to have the same rate of convergence for all Q with λ N,Q in place λ N,Q in (25). Property 1. The inequality λ 2 N,Q λ 2 N,Q holds for all Q 2 and all N N 0 (Q) = exp(γ(q/2) 1/(Q/2 1) ), and λ 2 N,Q λ 2 N,Q as N. Proof: Studying (23) as a function of ξ, one sees that the root for a finite sample is larger than the asymptotic root given by Embrechts et al. (1997, p.156) provided (Q/2 1) log log N log Γ(Q/2) 0. More than a bound the asymptotic Gumbel pivot for M N leads to a prior

19 Smooth James Stein 19 distribution F λ (λ) = G 0 ((λ 2 d N (Q))/2) of the threshold λ to reconstruct true zero vectors from noisy measurements. Bayes theorem provides the joint posterior distribution of the coefficients and the hyperparameters. Taking its negative logarithm leads to the following information criterion in the spirit of the sparsity l ν information criterion (Sardy, 2009). Definition. Suppose model (14) or model (3.1) of Cai (1999) for block thresholding in wavelet smoothing holds. The sparsity weighted l 2 information criterion for estimation of (α 1,..., α N ) and selection of (λ, ν) of the generalized James-Stein estimator (18) is SL w 2 IC(α 1,..., α N, λ, ν) = 1 N Y n α n λ ν N 2 n=1 n=1 Γ(Q/2) N log( 2π Q/2 Γ(Q) ) 1 Y n ν 1 2 α n 2 N QNν log λ + Q(ν 1) log Y n 2 n=1 log π λ (λ; τ N,Q ) log π ν (ν), (27) where π ν is a prior for ν (that we choose Uniform on Ω ν (0, ) (possibly improper) where thresholding occurs), π(λ; τ) = F (λ; τ) with F λ (λ; τ) = G 0 ((λ 2 /τ 2 d N (Q))/2) and τ is calibrated to τn,q 2 = λ 2 N,Q/(QNν + 1) to match the asymptotic model consistency when the true sequences are null, i.e., when α n = 0, n = 1,..., N. In practice, one minimizes SL w 2 IC like AIC or BIC to select the hyperparameters (λ, ν) and estimate the sequences α n, n = 1,..., N. Deriving an information criterion for the smooth James-Stein estimator (i.e., γ > 1) requires knowing the equivalent optimization problem it solves, which is an open problem.

20 Smooth James Stein Application to wave burst estimation and detection We illustrate how smooth James-Stein thresholding with SURE can be employed blockwise and levelwise for wavelet smoothing Q = 3 concomitant time series recorded to detect graviational wave bursts as described in Section 2. To that aim we consider three real series of length T = 2 14 (about 3.27 seconds of recording). Taking J = 4 in the wavelet expansion (2), we select 22 hyperparameters with SURE (11 levels with two hyperparameters each). We then add a so-called injection at time t = 1500, which represents an artificial wave burst. The injection is three times larger on the second captor than on the first, and five times smaller on the third captor, as plotted in dotted lines on Figure 4. Figure 3 shows the data (first line) recorded on each captor (columnwise), the estimate by blockwise smooth James-Stein smoothing (second line) and the estimate by employing a univariate smoothing per captor (third line). We observe that, as opposed to univariate smoothing, blockwise smoothing detects the injection and almost only the injection (except right before time t = ), whereas the other has many false detections. Zooming around the injection, Figure 4 shows that the estimation of the injection is also better with blockwise smoothing. Section 4.5 compares several estimators on a Monte-Carlo simulation. 4.4 Oracle inequality The l 2 risk for model (14) is N N R( ˆα, α) = ρ n (α n ) = E ˆα n α n 2 2. n=1 n=1

21 Smooth James Stein 21 Captor 1 Captor 2 Captor 3 data time Blockwise smoothing Univariate smoothing per captor time time time Figure 3: A wave burst injection at time t = 1500 added on three real captors noise (columnwise). Top: Time series of length T = Middle: Estimation with the smooth James-Stein estimator levelwise and blockwise with hyperparameters selected by SURE. Bottom: Estimation by applying a univariate smoothing per captor levelwise with hyperparameters selected by SURE. Following Donoho and Johnstone (1994), the oracle predictive performance of the block diagonal projection estimator ˆα n = δy, where δ n {0, 1} is α n 2, if δ n = 0, ρ n (δ n, α n ) = Q, if δ n = 1.

22 Smooth James Stein 22 Captor 1 Captor 2 Captor 3 Blockwise smoothing Univariate smoothing per captor time time time Figure 4: Zooming on the injection represented by a dotted line. The estimates are plotted with a continuous line: blockwise smoothing (first line), and a univariate smoothing per captor (second line). Hence the oracle hyperparameters are δ n = 1 { αn 2 >Q} for n = 1,..., N, and the corresponding oracle overall risk for this block-thresholding estimator is N R (δ, α) = min( α n 2, Q). n=1 The following theorem extends the oracle inequality obtained by Donoho and Johnstone (1994) for Q = 1 to block thresholding for Q 2. Theorem 1: Consider any fixed Q 2. Then there exists a sample size N 0 (Q) such that, for all N N 0 (Q) and with the universal threshold λ N,Q defined

23 Smooth James Stein 23 in (26), then the smooth James-Stein estimator (19) for ν = 1 and γ = 1 achieves the oracle inequality R(ˆα SJS λ N,Q,1;1, α) (Q + λ2 N,Q)(Q + R (δ, α)), where λ 2 N,Q = 2 log N + Q log log N 2 log Γ(Q/2). Proof: See Appendix A. This result shows we can mimic the overall oracle risk achieved with N oracle hyperparameters within a factor of essentially 2 log N + Q log log N 2 log Γ(Q/2) with a block thresholding estimator driven by a single hyperparameter. The smallest sample size N 0 (Q) for which the inequality holds is quite small in practice; more work is needed to get a tight expression. 4.5 Monte-Carlo simulation We reproduce the Monte-Carlo simulation of Johnstone and Silverman (2004) for Q = 1 captor to estimate a sparse sequence of length N = 1000 and of varying degrees of sparsity, as measured by the number of nonzero terms taken in {5, 50, 500} and by the value of the nonzero terms µ taken in {3, 4, 5, 7}. Table 1 reports estimated risks of four estimators: smooth James-Stein, generalized James-Stein, Subbotin l ν penalized likelihood (Sardy, 2009) and EBayesThresh (Johnstone and Silverman, 2004). The results clearly show the superiority of the smooth James- Stein estimator with SURE. Indeed the smooth- is always at least as good as the generalized-james-stein estimate, which proves excellent selection of the hyperparameters thanks to the smoothness of the estimate and of the SURE function when γ > 1. In case of extreme sparsity, only generalized James-Stein using the information criterion and EBayesThresh perform better. Note that this drawback

24 Smooth James Stein 24 of SURE has been explained by Donoho and Johnstone (1995, Section 2.4). Table 1: Monte-Carlo simulation for a sequence of length N = Average total squared loss of: the smooth James-Stein (SJS) estimator with hyperparameters selected with SURE, the generalized James-Stein (GJS) estimator with hyperparameters selected with SURE and SL w 2 IC, the Subbotin(λ, ν) posterior mode estimator with hyperparameters selected with SURE and SL ν IC, and the EBayesThresh estimator with Cauchy-like prior. Number nonzero Value nonzero µ = Q = 1 captor SJS SURE GJS SURE SL w 2 IC Subbotin SURE SL νic EBayesThresh Q = 3 captors SJS GJS SURE SL w 2 IC We consider a similar Monte-Carlo simulation with now Q = 3 concomitantly observed sequences: all three underlying sequences are identical in the location of the non zero entries, but not in their amplitude. Since three sequences carry more information than a single one, we may hope to distinguish noise from signal with a smaller signal to noise ratio. Hence we consider that the value of the nonzero terms are fixed to µ 1 = 1, µ 2 = 2 and µ 3 = µ is taken in {3, 4, 5, 7}. The estimated risk of the smooth James-Stein estimator is reported in Table 1. To allow some comparison with Q = 1, the estimated overall risk of Q N sequences is divided by Q. The smooth James-Stein estimator again performs best overall, while the SL w 2 IC selection rule performs better than for a single observed sequence, and the improvement from Q = 1 to Q = 3 is larger the more sparse the sequence.

25 Smooth James Stein 25 5 Smooth adaptive lasso We are now considering the classical parametric regression setting Y = µ + ɛ with µ = α Xα and ɛ i.i.d. N(0, 1), (28) where X is any N P regression matrix with a single response vector (i.e., Q = 1). Our goal here is to employ the smooth James-Stein thresholding function leading to a smooth version of adaptive lasso. We use ideas from optimization to implicitly define the smooth James-Stein estimate as a fixed point, and we derive an unbiased estimate of its risk using the implicit function theorem to obtain the Jacobian of the estimate with respect to the data. The additional smoothness of the estimator leads to a less erratic unbiased risk estimate that is easier to minimize as a function of its hyperparameters, as we observed in the previous section. 5.1 Definition When doing variable selection, we believe only a subset of the P covariates forms the true linear association, or equivalently that some entries α p of the coefficients vector α in (28) are zero. To estimate a sparse vector, one can solve the lasso penalized least squares 1 min β IR P 2 Y X Σβ λ ν P β p (29) ˆβ LS p ν 1 for a given regularization parameter λ. Note that in (29), the matrix X Σ means that the covariates have been mean-centered (to avoid penalizing the intercept) and rescaled, i.e., X Σ = (I 1 N J)XD Σ, where J is the N N matrix of ones and D Σ is a diagonal matrix that properly rescales the covariates; see Sardy (2008) p=1

26 Smooth James Stein 26 for a precise definition of Σ-rescaling. The estimator ˆβ λ solution to (29) is called lasso (Tibshirani, 1996) when ν = 1 and adaptive lasso (Zou, 2006) when ν > 1. Then the coefficients estimate of the original problem are ˆα λ = D Σ ˆβλ and ˆα 0 = ȳ 1 N 1T X ˆα λ. Adaptive lasso enjoys an oracle property in the sense that it can guess the true underlying model with a probability tending to one and behaves essentially like the least squares estimate on the correct model. Adaptive lasso is efficiently calculated thanks to the lars algorithm (Efron et al., 2004). In practice a stable selection of ν remains an open problem, to which we provide a solution below. We saw in the previous section (i.e., for X = I) both that the parameter ν can be selected by minimizing the Stein unbiased risk estimate and that its smooth James-Stein extension enjoys better predictive performance. Our goal here is to generalize both ideas to the regression setting (28). Relaxation, also called backfitting, is another possible algorithm to solve (29) (see Fu (1998); Sardy et al. (2000)). It allows to implicitly define the estimate as a fixed point to the algorithm in the following way. Starting with any initial guess for β, the relaxation algorithm keeps all entries constant except the pth entry updated to β new p = 1 ˆβ LS p ν 1 x p 2 2 ηλ soft ν LS ( ˆβ p ν 1 r T px p ), which is the solution to a convex univariate optimization subproblem, where r p = Y X p β old are the residuals without the pth term, and X p is the rescaled X Σ matrix where the pth column x p is replaced by a column of zeros. At the limit, a fixed point defines the solution which solves a system of nonlinear equations ˆβ p (Y) = 1 ˆβ LS p ν 1 x p 2 2 ηλ soft ν LS ( ˆβ p ν 1 r T px p )

27 Smooth James Stein 27 with r p = Y X p ˆβ(Y), p = 1,..., P. Instead of employing the almost differentiable soft thresholding, we iteratively employ the differentiable smooth James-Stein thresholding according to β new p = rt px p η soft LS λ ( ˆβ ν p ν 1 r T γ px p ) x p 2 2 ˆβ LS p ν 1 r T px p with r p = Y X p β old. The fixed point ˆβ p (Y) = rt px p η soft LS λ ( ˆβ ν p ν 1 r T γ px p ) x p 2 2 ˆβ LS p ν 1 r T px p with r p = Y X p ˆβ(Y), p = 1,..., P, (30) implicitly defines the smooth James-Stein estimate, called smooth adaptive lasso estimate. One checks that when X = I (in which case ˆβ LS p = y p for p = 1,..., N) then (30) is smooth James-Stein thresholding previously defined (19). The cases γ = 0 and γ = 1 are known to converge to a fixed point since it corresponds to solving the least squares and adaptive lasso, respectively. We have studied and always observed convergence to a unique estimate empirically for a whole range of values for γ. So we conjecture convergence to a unique fixed point for all γ 0. The assumption that the columns of X are linearly independent is sufficient to guarantee a unique (adaptive) lasso estimate, but can be relaxed under certain conditions (Sardy, 2009, Theorem 3). 5.2 Unbiased risk estimate Provided the estimator ˆµ = g(y) + Y with g(y) = ȳ1 + X Σ ˆβ(Y) Y and ˆβ(Y) defined in (30) is almost differentiable with respect to the data, we can employ the

28 Smooth James Stein 28 Stein unbiased risk estimate to select (λ, ν). This requires the partial derivatives g n (Y)/ Y n = x T n n ˆβ(Y) 1 + 1/N n = 1,..., N, (31) where n ˆβ(Y) are the derivative of ˆβ(Y) with respect to Yn and x 1,..., x N are the rows of X Σ. The following theorem shows how this can be done. Theorem 2: Let ˆβ LS = AY where A = (X T ΣX Σ ) X T Σ = (a pn ) (using the generalized inverse if necessary). For a given (λ > 0, ν 1), the smooth adaptive lasso (30) estimate ˆβ(Y) is continuously differentiable for γ > 1. Moreover let I 0 = {p {1,..., P } : ˆβ p (Y) = 0} = {p i } i=1,..., I0 } of cardinal I 0, let Ī0 be its complement, and let XĪ0 Σ be the columns of X Σ with an index in Ī0. For a given n, the non-zero elements hī0 n of n ˆβ(Y) are solution to the following system of Ī 0 linear equations with as many unknowns: x T p XĪ0 ΣDĪ0 p hī0 n = z ni p = p i, i = 1,..., Ī0, (32) where z ni = x np + u p v p, (33) and DĪ0 p is the identity matrix except that its ith diagonal element is DĪ0 p,ii = w γ p 1 γ + γw p =: v 1 p (34) with w p = η soft λ ν LS ˆβ p ν 1 r T px p ( ˆβ LS p ν 1 r T px p ) and u p = γ(ν 1)a pn(w p 1)r T px p. (35) ˆβ LS p w γ p

29 Smooth James Stein 29 Solving (32) leads to the gradient n ˆβ(Y) needed in (31) to calculate SURE(λ, ν; γ) for all λ, ν and γ. Proof: See Appendix B. Note that we used the least squares estimate as raw estimate of the coefficients, but the formula would still hold for other linear estimators. At the limit when γ tends to one, the adaptive lasso has DĪ0 p,ii = 1 for all i in (34). A particular case of interest is the lasso (i.e., ν = 1) for which we have z ni = x np in (33), so that (32) leads to hī0 n = ((XĪ0 ) T 1 XĪ0 ) ( xī0 n ) T. Hence, from (31), we see that N g n (Y)/ Y n = N N x T n n ˆβ(Y) n=1 n=1 = N N n ((XĪ0 Σ) T 1 XĪ0 Σ) ( xī0 n ) T. n=1 = N trace(((xī0 Σ) T 1 XĪ0 Σ) ((XĪ0 Σ) T XĪ0 Σ)) = N Ī0, where Ī0 is lasso s degrees of freedom previously found by Zou et al. (2007). In practice we first minimize SURE over (λ, ν) for γ = 1 (i.e., adaptive lasso) which we do efficiently on a fine grid thanks to lars (Efron et al., 2004). This provides a neighborhood for λ, namely [λ/10, 10λ], over which we then search on a grid for the minimizer of SURE for smooth adaptive lasso (γ > 1). This strategy provides a good and efficient selection of the pair of hyperparameters. 5.3 Monte-Carlo simulation We replicate the Monte-Carlo simulation of Zou (2006) with P = 8 covariates with corresponding coefficients either sparse (α = (3, 1.5, 0, 0, 2, 0, 0, 0) for Model 1 with N {20, 60}), or not sparse (α = (.85,.85,.85,.85,.85,.85,.85,.85) for Model 2

30 Smooth James Stein 30 with N {40, 80}). The covariates x n (n = 1,..., N) are i.i.d. Gaussian vectors with pairwise correlation between x n,i and x n,j given by cor(i, j) = (.5) i j. The noise is Gaussian with standard error σ {1, 3, 6}. Like Zou (2006) we consider the relative prediction error RPE = E[x T test{ ˆα([X, Y] training ) α}]/σ 2, where the expectation is taken over training and test sets, and response. Note that we calculate RPE given (X, Y) training exactly using the knowledge of the distribution of the covariates (Zou relies on 10,000 test observations instead). This predictive measure is reported in Table 2 to compare lasso, adaptive lasso and smooth adaptive lasso using two-fold cross-validation to select their hyperparameter(s). We observe that smooth adaptive lasso improves significantly over adaptive lasso and over lasso when the underlying model is sparse and the noise is small; the estimation is slightly worse for the non-sparse model. We also consider selection of the hyperparameters with SURE to compare adaptive lasso and its smooth version. To compare the two estimators, we consider the relative prediction error conditional on the covariates of the training set RPE X = N n=1 E[x T training,n{ ˆα([X, Y] training ) α}]/σ 2, where the expectation is taken over the response only. This measure, although less useful in practice unless prediction is sought at the same locations as the training set, helps quantify the improvement of using the smooth James-Stein thresholding. Table 3 reports the results, which shows a systematic gain of the smooth version over adaptive lasso. Lasso is often better for the non-sparse model. We also report in Table 4 the number C of selected nonzero components and the number I of zero components incorrectly selected. We observe that the selection is correct when the noise is small (σ = 1) with smooth/adaptive lasso based on SURE, but that some false detection occurs when the noise grows (σ = 3).

31 Smooth James Stein 31 Table 2: Zou s Monte-Carlo simulation: median RPE using two-fold crossvalidation to select the hyperparameter(s) on 100 training sets. σ = 1 σ = 3 σ = 6 Model 1 (N = 20) Lasso (0.048) (0.069) (0.021) Adaptive lasso (0.051) (0.057) (0.021) Smooth adaptive lasso (0.046) (0.056) (0.020) Model 1 (N = 60) Lasso (0.009) (0.008) (0.010) Adaptive lasso (0.009) (0.009) (0.011) Smooth adaptive lasso (0.009) (0.009) (0.010) Model 2 (N = 40) Lasso (0.014) (0.020) (0.010) Adaptive lasso (0.015) (0.021) (0.010) Smooth adaptive lasso (0.015) (0.021) (0.010) Model 2 (N = 80) Lasso (0.005) (0.005) (0.005) Adaptive lasso (0.005) (0.005) (0.005) Smooth adaptive lasso (0.005) (0.005) (0.005) Table 3: Zou s Monte-Carlo simulation: median RPE X at the training covariates X using SURE to select the hyperparameter(s) on 100 training sets. σ = 1 σ = 3 σ = 6 Model 1 (N = 20) Lasso (0.021) (0.020) (0.017) Adaptive lasso (0.023) (0.023) (0.019) Smooth adaptive lasso (0.023) (0.025) (0.019) Model 1 (N = 60) Lasso (0.008) (0.008) (0.007) Adaptive lasso (0.008) (0.008) (0.007) Smooth adaptive lasso (0.008) (0.008) (0.007) Model 2 (N = 40) Lasso (0.010) (0.010) (0.008) Adaptive lasso (0.011) (0.013) (0.009) Smooth adaptive lasso (0.011) (0.013) (0.008) Model 2 (N = 80) Lasso (0.004) (0.004) (0.003) Adaptive lasso (0.004) (0.006) (0.004) Smooth adaptive lasso (0.004) (0.006) (0.004)

32 Smooth James Stein 32 Table 4: Median number of Selected Variables for Model 1 with n = 60 σ = 1 σ = 3 C I C I Truth Lasso Adaptive lasso SCAD Garotte Adaptive lasso SURE Smooth adaptive lasso SURE Results taken from the Monte-Carlo simulation of Zou (2006). 5.4 Smooth SURE for the prostate cancer data We illustrate the estimation of the l 2 risk and the smoothness gained with smooth adaptive lasso on the prostate cancer data with P = 8 covariates used by Tibshirani (1996). Figure 5.4 shows the estimated risk as a function of the two hyperparameters λ and ν, either for adaptive lasso (left) or for the smooth adaptive lasso (right). Both estimations of the risk are unbiased, but the latter is less erratic. The advantage of smoothness would be even more pronounced with data containing a larger collection of covariates. The optimal hyperparameters (ˆλ, ˆν) = (1.12, 10) select a subset with five covariates out of the eight original ones. 6 Further extensions Smooth James-Stein thresholding (19) offers additional flexibility to fit the underlying sparsity thanks to the parameters (λ, ν) for which we proposed selection rules. Interestingly, this thresholding function relies on the l 2 norm. This measure may not be appropriate in certain situations. Indeed if one wants to measure departure from the zero vector in a sense that all entries must be different from zero,

33 SURE SURE Smooth James Stein 33 nu nu log(lambda) log(lambda) Figure 5: Prostate cancer data: Stein unbiased risk estimate as a function of λ and ν for adaptive lasso (left, with γ = 1) and smooth adaptive lasso (right, with γ = log ν)). then Y 2 in (19) should possibly be replaced by ˆα n,λ,ν;γ = (1 λ ν min q=1,...,q Y nq ν )γ +Y n λ 0, ν > 0, γ 1, Y n IR Q,(36) in which case the universal threshold is λ N,Q = 2/Q log N. Other quantiles could be considered, or other norms like the l 1 norm to penalize less than the l 2 norm a large departure from zero in the entries of Y. For the group lasso, Yuan and Lin (2006) approximate the degrees of freedom for their estimator. One can instead follow the derivation of Theorem 2 to derive the exact degrees of freedom. Consider the smooth group adaptive lasso, a

34 Smooth James Stein 34 generalization of group lasso (13), defined as ˆβ j (Y) = s j (1/w j ) γ with w j = LS ˆβ j ν 1 2 s j 2 ηλ soft ν ( ˆβ LS j ν 1 2 s j 2 ) s j = X T j r j r j = Y X j ˆβ(Y) j = 1,..., J, (37) where the coefficients vector β IR P is segmented as β = (β,..., β J ) in J blocks of respective size p j such that J j=1 p j = P. Likewise ˆβ LS = (ˆβ LS,..., ˆβ LS J ) and X = [X 1... X J ] with X T j X j = I pj for j = 1,..., J. Note that (37) is the generalization of (2.4) from Yuan and Lin (2006) taking K j = I pj in their notation. Let J 0 = {j {1,..., J} : ˆβ j (Y) = 0} = {j i } i=1,..., J0 } of cardinal J 0. Deriving SURE in J that setting requires calculating the nonzero elements h 0 n of the gradient n ˆβ(Y) of the estimated vector defined by (37) with respect to the data Y n, n = 1,..., N. Following similar derivation as in Appendix B, ones finds that h J 0 n a set of linear equations of the form (32), more precisely are defined by C j X T j X J 0 J D 0 j h J 0 n = z nj j = j i, i = 1,..., J 0, (38) where z nj are vectors of length p j and D J 0 j is a block diagonal matrix made of identity matrices, except for the jth block equal to Cj 1, with C j = e j s j s T j + f j I pj is a positive definite matrix since e j = γ(1/w j ) γ 1 λ ν / ˆβ LS j ν 1 2 / s j 3 2 > 0 and f j = (1/w j ) γ > 0. Based on (38), SURE can be calculated and minimized over all values of γ 1 and ν 1 to provide the equivalent degrees of freedom of (smooth adaptive) group lasso. Unless blocks are of unit size, in which case there is no more grouping of covariates, there does not seem to be a close form expression for the degrees of freedom of group lasso when ν = 1 and γ 1.

Smooth blockwise iterative thresholding: a smooth fixed point estimator based on the likelihood s block gradient

Smooth blockwise iterative thresholding: a smooth fixed point estimator based on the likelihood s block gradient Smooth blockwise iterative thresholding: a smooth fixed point estimator based on the likelihood s block gradient SYLVAIN SARDY Department of Mathematics, University of Geneva The proposed smooth blockwise

More information

Nonconcave Penalized Likelihood with A Diverging Number of Parameters

Nonconcave Penalized Likelihood with A Diverging Number of Parameters Nonconcave Penalized Likelihood with A Diverging Number of Parameters Jianqing Fan and Heng Peng Presenter: Jiale Xu March 12, 2010 Jianqing Fan and Heng Peng Presenter: JialeNonconcave Xu () Penalized

More information

Smoothly Clipped Absolute Deviation (SCAD) for Correlated Variables

Smoothly Clipped Absolute Deviation (SCAD) for Correlated Variables Smoothly Clipped Absolute Deviation (SCAD) for Correlated Variables LIB-MA, FSSM Cadi Ayyad University (Morocco) COMPSTAT 2010 Paris, August 22-27, 2010 Motivations Fan and Li (2001), Zou and Li (2008)

More information

Lecture 14: Variable Selection - Beyond LASSO

Lecture 14: Variable Selection - Beyond LASSO Fall, 2017 Extension of LASSO To achieve oracle properties, L q penalty with 0 < q < 1, SCAD penalty (Fan and Li 2001; Zhang et al. 2007). Adaptive LASSO (Zou 2006; Zhang and Lu 2007; Wang et al. 2007)

More information

Analysis Methods for Supersaturated Design: Some Comparisons

Analysis Methods for Supersaturated Design: Some Comparisons Journal of Data Science 1(2003), 249-260 Analysis Methods for Supersaturated Design: Some Comparisons Runze Li 1 and Dennis K. J. Lin 2 The Pennsylvania State University Abstract: Supersaturated designs

More information

Bayesian Grouped Horseshoe Regression with Application to Additive Models

Bayesian Grouped Horseshoe Regression with Application to Additive Models Bayesian Grouped Horseshoe Regression with Application to Additive Models Zemei Xu, Daniel F. Schmidt, Enes Makalic, Guoqi Qian, and John L. Hopper Centre for Epidemiology and Biostatistics, Melbourne

More information

Machine Learning for OR & FE

Machine Learning for OR & FE Machine Learning for OR & FE Regression II: Regularization and Shrinkage Methods Martin Haugh Department of Industrial Engineering and Operations Research Columbia University Email: martin.b.haugh@gmail.com

More information

The Adaptive Lasso and Its Oracle Properties Hui Zou (2006), JASA

The Adaptive Lasso and Its Oracle Properties Hui Zou (2006), JASA The Adaptive Lasso and Its Oracle Properties Hui Zou (2006), JASA Presented by Dongjun Chung March 12, 2010 Introduction Definition Oracle Properties Computations Relationship: Nonnegative Garrote Extensions:

More information

Biostatistics Advanced Methods in Biostatistics IV

Biostatistics Advanced Methods in Biostatistics IV Biostatistics 140.754 Advanced Methods in Biostatistics IV Jeffrey Leek Assistant Professor Department of Biostatistics jleek@jhsph.edu Lecture 12 1 / 36 Tip + Paper Tip: As a statistician the results

More information

WEIGHTED QUANTILE REGRESSION THEORY AND ITS APPLICATION. Abstract

WEIGHTED QUANTILE REGRESSION THEORY AND ITS APPLICATION. Abstract Journal of Data Science,17(1). P. 145-160,2019 DOI:10.6339/JDS.201901_17(1).0007 WEIGHTED QUANTILE REGRESSION THEORY AND ITS APPLICATION Wei Xiong *, Maozai Tian 2 1 School of Statistics, University of

More information

Regularization and Variable Selection via the Elastic Net

Regularization and Variable Selection via the Elastic Net p. 1/1 Regularization and Variable Selection via the Elastic Net Hui Zou and Trevor Hastie Journal of Royal Statistical Society, B, 2005 Presenter: Minhua Chen, Nov. 07, 2008 p. 2/1 Agenda Introduction

More information

TECHNICAL REPORT NO. 1091r. A Note on the Lasso and Related Procedures in Model Selection

TECHNICAL REPORT NO. 1091r. A Note on the Lasso and Related Procedures in Model Selection DEPARTMENT OF STATISTICS University of Wisconsin 1210 West Dayton St. Madison, WI 53706 TECHNICAL REPORT NO. 1091r April 2004, Revised December 2004 A Note on the Lasso and Related Procedures in Model

More information

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Sparse regression. Optimization-Based Data Analysis.   Carlos Fernandez-Granda Sparse regression Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda 3/28/2016 Regression Least-squares regression Example: Global warming Logistic

More information

Comparisons of penalized least squares. methods by simulations

Comparisons of penalized least squares. methods by simulations Comparisons of penalized least squares arxiv:1405.1796v1 [stat.co] 8 May 2014 methods by simulations Ke ZHANG, Fan YIN University of Science and Technology of China, Hefei 230026, China Shifeng XIONG Academy

More information

Permutation-invariant regularization of large covariance matrices. Liza Levina

Permutation-invariant regularization of large covariance matrices. Liza Levina Liza Levina Permutation-invariant covariance regularization 1/42 Permutation-invariant regularization of large covariance matrices Liza Levina Department of Statistics University of Michigan Joint work

More information

Direct Learning: Linear Regression. Donglin Zeng, Department of Biostatistics, University of North Carolina

Direct Learning: Linear Regression. Donglin Zeng, Department of Biostatistics, University of North Carolina Direct Learning: Linear Regression Parametric learning We consider the core function in the prediction rule to be a parametric function. The most commonly used function is a linear function: squared loss:

More information

Regression, Ridge Regression, Lasso

Regression, Ridge Regression, Lasso Regression, Ridge Regression, Lasso Fabio G. Cozman - fgcozman@usp.br October 2, 2018 A general definition Regression studies the relationship between a response variable Y and covariates X 1,..., X n.

More information

Chapter 3. Linear Models for Regression

Chapter 3. Linear Models for Regression Chapter 3. Linear Models for Regression Wei Pan Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN 55455 Email: weip@biostat.umn.edu PubH 7475/8475 c Wei Pan Linear

More information

Paper Review: Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties by Jianqing Fan and Runze Li (2001)

Paper Review: Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties by Jianqing Fan and Runze Li (2001) Paper Review: Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties by Jianqing Fan and Runze Li (2001) Presented by Yang Zhao March 5, 2010 1 / 36 Outlines 2 / 36 Motivation

More information

Robust Variable Selection Methods for Grouped Data. Kristin Lee Seamon Lilly

Robust Variable Selection Methods for Grouped Data. Kristin Lee Seamon Lilly Robust Variable Selection Methods for Grouped Data by Kristin Lee Seamon Lilly A dissertation submitted to the Graduate Faculty of Auburn University in partial fulfillment of the requirements for the Degree

More information

A Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression

A Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression A Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression Noah Simon Jerome Friedman Trevor Hastie November 5, 013 Abstract In this paper we purpose a blockwise descent

More information

Consistent high-dimensional Bayesian variable selection via penalized credible regions

Consistent high-dimensional Bayesian variable selection via penalized credible regions Consistent high-dimensional Bayesian variable selection via penalized credible regions Howard Bondell bondell@stat.ncsu.edu Joint work with Brian Reich Howard Bondell p. 1 Outline High-Dimensional Variable

More information

Model Selection and Geometry

Model Selection and Geometry Model Selection and Geometry Pascal Massart Université Paris-Sud, Orsay Leipzig, February Purpose of the talk! Concentration of measure plays a fundamental role in the theory of model selection! Model

More information

A Modern Look at Classical Multivariate Techniques

A Modern Look at Classical Multivariate Techniques A Modern Look at Classical Multivariate Techniques Yoonkyung Lee Department of Statistics The Ohio State University March 16-20, 2015 The 13th School of Probability and Statistics CIMAT, Guanajuato, Mexico

More information

Lasso Maximum Likelihood Estimation of Parametric Models with Singular Information Matrices

Lasso Maximum Likelihood Estimation of Parametric Models with Singular Information Matrices Article Lasso Maximum Likelihood Estimation of Parametric Models with Singular Information Matrices Fei Jin 1,2 and Lung-fei Lee 3, * 1 School of Economics, Shanghai University of Finance and Economics,

More information

Regularization: Ridge Regression and the LASSO

Regularization: Ridge Regression and the LASSO Agenda Wednesday, November 29, 2006 Agenda Agenda 1 The Bias-Variance Tradeoff 2 Ridge Regression Solution to the l 2 problem Data Augmentation Approach Bayesian Interpretation The SVD and Ridge Regression

More information

A significance test for the lasso

A significance test for the lasso 1 First part: Joint work with Richard Lockhart (SFU), Jonathan Taylor (Stanford), and Ryan Tibshirani (Carnegie-Mellon Univ.) Second part: Joint work with Max Grazier G Sell, Stefan Wager and Alexandra

More information

Selection of Smoothing Parameter for One-Step Sparse Estimates with L q Penalty

Selection of Smoothing Parameter for One-Step Sparse Estimates with L q Penalty Journal of Data Science 9(2011), 549-564 Selection of Smoothing Parameter for One-Step Sparse Estimates with L q Penalty Masaru Kanba and Kanta Naito Shimane University Abstract: This paper discusses the

More information

On High-Dimensional Cross-Validation

On High-Dimensional Cross-Validation On High-Dimensional Cross-Validation BY WEI-CHENG HSIAO Institute of Statistical Science, Academia Sinica, 128 Academia Road, Section 2, Nankang, Taipei 11529, Taiwan hsiaowc@stat.sinica.edu.tw 5 WEI-YING

More information

ISyE 691 Data mining and analytics

ISyE 691 Data mining and analytics ISyE 691 Data mining and analytics Regression Instructor: Prof. Kaibo Liu Department of Industrial and Systems Engineering UW-Madison Email: kliu8@wisc.edu Office: Room 3017 (Mechanical Engineering Building)

More information

High-dimensional Ordinary Least-squares Projection for Screening Variables

High-dimensional Ordinary Least-squares Projection for Screening Variables 1 / 38 High-dimensional Ordinary Least-squares Projection for Screening Variables Chenlei Leng Joint with Xiangyu Wang (Duke) Conference on Nonparametric Statistics for Big Data and Celebration to Honor

More information

Iterative Selection Using Orthogonal Regression Techniques

Iterative Selection Using Orthogonal Regression Techniques Iterative Selection Using Orthogonal Regression Techniques Bradley Turnbull 1, Subhashis Ghosal 1 and Hao Helen Zhang 2 1 Department of Statistics, North Carolina State University, Raleigh, NC, USA 2 Department

More information

Confidence Intervals for Low-dimensional Parameters with High-dimensional Data

Confidence Intervals for Low-dimensional Parameters with High-dimensional Data Confidence Intervals for Low-dimensional Parameters with High-dimensional Data Cun-Hui Zhang and Stephanie S. Zhang Rutgers University and Columbia University September 14, 2012 Outline Introduction Methodology

More information

The MNet Estimator. Patrick Breheny. Department of Biostatistics Department of Statistics University of Kentucky. August 2, 2010

The MNet Estimator. Patrick Breheny. Department of Biostatistics Department of Statistics University of Kentucky. August 2, 2010 Department of Biostatistics Department of Statistics University of Kentucky August 2, 2010 Joint work with Jian Huang, Shuangge Ma, and Cun-Hui Zhang Penalized regression methods Penalized methods have

More information

MSA220/MVE440 Statistical Learning for Big Data

MSA220/MVE440 Statistical Learning for Big Data MSA220/MVE440 Statistical Learning for Big Data Lecture 9-10 - High-dimensional regression Rebecka Jörnsten Mathematical Sciences University of Gothenburg and Chalmers University of Technology Recap from

More information

Generalized Elastic Net Regression

Generalized Elastic Net Regression Abstract Generalized Elastic Net Regression Geoffroy MOURET Jean-Jules BRAULT Vahid PARTOVINIA This work presents a variation of the elastic net penalization method. We propose applying a combined l 1

More information

High-dimensional regression with unknown variance

High-dimensional regression with unknown variance High-dimensional regression with unknown variance Christophe Giraud Ecole Polytechnique march 2012 Setting Gaussian regression with unknown variance: Y i = f i + ε i with ε i i.i.d. N (0, σ 2 ) f = (f

More information

High dimensional thresholded regression and shrinkage effect

High dimensional thresholded regression and shrinkage effect J. R. Statist. Soc. B (014) 76, Part 3, pp. 67 649 High dimensional thresholded regression and shrinkage effect Zemin Zheng, Yingying Fan and Jinchi Lv University of Southern California, Los Angeles, USA

More information

MS-C1620 Statistical inference

MS-C1620 Statistical inference MS-C1620 Statistical inference 10 Linear regression III Joni Virta Department of Mathematics and Systems Analysis School of Science Aalto University Academic year 2018 2019 Period III - IV 1 / 32 Contents

More information

Sparsity Regularization

Sparsity Regularization Sparsity Regularization Bangti Jin Course Inverse Problems & Imaging 1 / 41 Outline 1 Motivation: sparsity? 2 Mathematical preliminaries 3 l 1 solvers 2 / 41 problem setup finite-dimensional formulation

More information

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Sparse Recovery using L1 minimization - algorithms Yuejie Chi Department of Electrical and Computer Engineering Spring

More information

SOLVING NON-CONVEX LASSO TYPE PROBLEMS WITH DC PROGRAMMING. Gilles Gasso, Alain Rakotomamonjy and Stéphane Canu

SOLVING NON-CONVEX LASSO TYPE PROBLEMS WITH DC PROGRAMMING. Gilles Gasso, Alain Rakotomamonjy and Stéphane Canu SOLVING NON-CONVEX LASSO TYPE PROBLEMS WITH DC PROGRAMMING Gilles Gasso, Alain Rakotomamonjy and Stéphane Canu LITIS - EA 48 - INSA/Universite de Rouen Avenue de l Université - 768 Saint-Etienne du Rouvray

More information

Spatial Lasso with Applications to GIS Model Selection

Spatial Lasso with Applications to GIS Model Selection Spatial Lasso with Applications to GIS Model Selection Hsin-Cheng Huang Institute of Statistical Science, Academia Sinica Nan-Jung Hsu National Tsing-Hua University David Theobald Colorado State University

More information

Institute of Statistics Mimeo Series No Simultaneous regression shrinkage, variable selection and clustering of predictors with OSCAR

Institute of Statistics Mimeo Series No Simultaneous regression shrinkage, variable selection and clustering of predictors with OSCAR DEPARTMENT OF STATISTICS North Carolina State University 2501 Founders Drive, Campus Box 8203 Raleigh, NC 27695-8203 Institute of Statistics Mimeo Series No. 2583 Simultaneous regression shrinkage, variable

More information

Signal Denoising with Wavelets

Signal Denoising with Wavelets Signal Denoising with Wavelets Selin Aviyente Department of Electrical and Computer Engineering Michigan State University March 30, 2010 Introduction Assume an additive noise model: x[n] = f [n] + w[n]

More information

Data Mining Stat 588

Data Mining Stat 588 Data Mining Stat 588 Lecture 02: Linear Methods for Regression Department of Statistics & Biostatistics Rutgers University September 13 2011 Regression Problem Quantitative generic output variable Y. Generic

More information

A significance test for the lasso

A significance test for the lasso 1 Gold medal address, SSC 2013 Joint work with Richard Lockhart (SFU), Jonathan Taylor (Stanford), and Ryan Tibshirani (Carnegie-Mellon Univ.) Reaping the benefits of LARS: A special thanks to Brad Efron,

More information

High-dimensional regression

High-dimensional regression High-dimensional regression Advanced Methods for Data Analysis 36-402/36-608) Spring 2014 1 Back to linear regression 1.1 Shortcomings Suppose that we are given outcome measurements y 1,... y n R, and

More information

Feature selection with high-dimensional data: criteria and Proc. Procedures

Feature selection with high-dimensional data: criteria and Proc. Procedures Feature selection with high-dimensional data: criteria and Procedures Zehua Chen Department of Statistics & Applied Probability National University of Singapore Conference in Honour of Grace Wahba, June

More information

Relaxed Lasso. Nicolai Meinshausen December 14, 2006

Relaxed Lasso. Nicolai Meinshausen December 14, 2006 Relaxed Lasso Nicolai Meinshausen nicolai@stat.berkeley.edu December 14, 2006 Abstract The Lasso is an attractive regularisation method for high dimensional regression. It combines variable selection with

More information

Regression Shrinkage and Selection via the Lasso

Regression Shrinkage and Selection via the Lasso Regression Shrinkage and Selection via the Lasso ROBERT TIBSHIRANI, 1996 Presenter: Guiyun Feng April 27 () 1 / 20 Motivation Estimation in Linear Models: y = β T x + ɛ. data (x i, y i ), i = 1, 2,...,

More information

The Risk of James Stein and Lasso Shrinkage

The Risk of James Stein and Lasso Shrinkage Econometric Reviews ISSN: 0747-4938 (Print) 1532-4168 (Online) Journal homepage: http://tandfonline.com/loi/lecr20 The Risk of James Stein and Lasso Shrinkage Bruce E. Hansen To cite this article: Bruce

More information

Sparse Linear Models (10/7/13)

Sparse Linear Models (10/7/13) STA56: Probabilistic machine learning Sparse Linear Models (0/7/) Lecturer: Barbara Engelhardt Scribes: Jiaji Huang, Xin Jiang, Albert Oh Sparsity Sparsity has been a hot topic in statistics and machine

More information

Variable Selection in Restricted Linear Regression Models. Y. Tuaç 1 and O. Arslan 1

Variable Selection in Restricted Linear Regression Models. Y. Tuaç 1 and O. Arslan 1 Variable Selection in Restricted Linear Regression Models Y. Tuaç 1 and O. Arslan 1 Ankara University, Faculty of Science, Department of Statistics, 06100 Ankara/Turkey ytuac@ankara.edu.tr, oarslan@ankara.edu.tr

More information

Bayesian variable selection via. Penalized credible regions. Brian Reich, NCSU. Joint work with. Howard Bondell and Ander Wilson

Bayesian variable selection via. Penalized credible regions. Brian Reich, NCSU. Joint work with. Howard Bondell and Ander Wilson Bayesian variable selection via penalized credible regions Brian Reich, NC State Joint work with Howard Bondell and Ander Wilson Brian Reich, NCSU Penalized credible regions 1 Motivation big p, small n

More information

ESL Chap3. Some extensions of lasso

ESL Chap3. Some extensions of lasso ESL Chap3 Some extensions of lasso 1 Outline Consistency of lasso for model selection Adaptive lasso Elastic net Group lasso 2 Consistency of lasso for model selection A number of authors have studied

More information

This model of the conditional expectation is linear in the parameters. A more practical and relaxed attitude towards linear regression is to say that

This model of the conditional expectation is linear in the parameters. A more practical and relaxed attitude towards linear regression is to say that Linear Regression For (X, Y ) a pair of random variables with values in R p R we assume that E(Y X) = β 0 + with β R p+1. p X j β j = (1, X T )β j=1 This model of the conditional expectation is linear

More information

A Confidence Region Approach to Tuning for Variable Selection

A Confidence Region Approach to Tuning for Variable Selection A Confidence Region Approach to Tuning for Variable Selection Funda Gunes and Howard D. Bondell Department of Statistics North Carolina State University Abstract We develop an approach to tuning of penalized

More information

Or How to select variables Using Bayesian LASSO

Or How to select variables Using Bayesian LASSO Or How to select variables Using Bayesian LASSO x 1 x 2 x 3 x 4 Or How to select variables Using Bayesian LASSO x 1 x 2 x 3 x 4 Or How to select variables Using Bayesian LASSO On Bayesian Variable Selection

More information

Linear Model Selection and Regularization

Linear Model Selection and Regularization Linear Model Selection and Regularization Recall the linear model Y = β 0 + β 1 X 1 + + β p X p + ɛ. In the lectures that follow, we consider some approaches for extending the linear model framework. In

More information

Statistical Data Mining and Machine Learning Hilary Term 2016

Statistical Data Mining and Machine Learning Hilary Term 2016 Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes

More information

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010 Linear Regression Aarti Singh Machine Learning 10-701/15-781 Sept 27, 2010 Discrete to Continuous Labels Classification Sports Science News Anemic cell Healthy cell Regression X = Document Y = Topic X

More information

Optimization methods

Optimization methods Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

Robust Variable Selection Through MAVE

Robust Variable Selection Through MAVE Robust Variable Selection Through MAVE Weixin Yao and Qin Wang Abstract Dimension reduction and variable selection play important roles in high dimensional data analysis. Wang and Yin (2008) proposed sparse

More information

High-dimensional covariance estimation based on Gaussian graphical models

High-dimensional covariance estimation based on Gaussian graphical models High-dimensional covariance estimation based on Gaussian graphical models Shuheng Zhou Department of Statistics, The University of Michigan, Ann Arbor IMA workshop on High Dimensional Phenomena Sept. 26,

More information

IEOR 165 Lecture 7 1 Bias-Variance Tradeoff

IEOR 165 Lecture 7 1 Bias-Variance Tradeoff IEOR 165 Lecture 7 Bias-Variance Tradeoff 1 Bias-Variance Tradeoff Consider the case of parametric regression with β R, and suppose we would like to analyze the error of the estimate ˆβ in comparison to

More information

ASYMPTOTIC PROPERTIES OF BRIDGE ESTIMATORS IN SPARSE HIGH-DIMENSIONAL REGRESSION MODELS

ASYMPTOTIC PROPERTIES OF BRIDGE ESTIMATORS IN SPARSE HIGH-DIMENSIONAL REGRESSION MODELS ASYMPTOTIC PROPERTIES OF BRIDGE ESTIMATORS IN SPARSE HIGH-DIMENSIONAL REGRESSION MODELS Jian Huang 1, Joel L. Horowitz 2, and Shuangge Ma 3 1 Department of Statistics and Actuarial Science, University

More information

Machine learning, shrinkage estimation, and economic theory

Machine learning, shrinkage estimation, and economic theory Machine learning, shrinkage estimation, and economic theory Maximilian Kasy December 14, 2018 1 / 43 Introduction Recent years saw a boom of machine learning methods. Impressive advances in domains such

More information

Lecture Notes 5: Multiresolution Analysis

Lecture Notes 5: Multiresolution Analysis Optimization-based data analysis Fall 2017 Lecture Notes 5: Multiresolution Analysis 1 Frames A frame is a generalization of an orthonormal basis. The inner products between the vectors in a frame and

More information

Choosing the Summary Statistics and the Acceptance Rate in Approximate Bayesian Computation

Choosing the Summary Statistics and the Acceptance Rate in Approximate Bayesian Computation Choosing the Summary Statistics and the Acceptance Rate in Approximate Bayesian Computation COMPSTAT 2010 Revised version; August 13, 2010 Michael G.B. Blum 1 Laboratoire TIMC-IMAG, CNRS, UJF Grenoble

More information

Linear regression methods

Linear regression methods Linear regression methods Most of our intuition about statistical methods stem from linear regression. For observations i = 1,..., n, the model is Y i = p X ij β j + ε i, j=1 where Y i is the response

More information

Linear Methods for Regression. Lijun Zhang

Linear Methods for Regression. Lijun Zhang Linear Methods for Regression Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Linear Regression Models and Least Squares Subset Selection Shrinkage Methods Methods Using Derived

More information

Comments on \Wavelets in Statistics: A Review" by. A. Antoniadis. Jianqing Fan. University of North Carolina, Chapel Hill

Comments on \Wavelets in Statistics: A Review by. A. Antoniadis. Jianqing Fan. University of North Carolina, Chapel Hill Comments on \Wavelets in Statistics: A Review" by A. Antoniadis Jianqing Fan University of North Carolina, Chapel Hill and University of California, Los Angeles I would like to congratulate Professor Antoniadis

More information

Lecture 5: Soft-Thresholding and Lasso

Lecture 5: Soft-Thresholding and Lasso High Dimensional Data and Statistical Learning Lecture 5: Soft-Thresholding and Lasso Weixing Song Department of Statistics Kansas State University Weixing Song STAT 905 October 23, 2014 1/54 Outline Penalized

More information

Variable Selection for Highly Correlated Predictors

Variable Selection for Highly Correlated Predictors Variable Selection for Highly Correlated Predictors Fei Xue and Annie Qu arxiv:1709.04840v1 [stat.me] 14 Sep 2017 Abstract Penalty-based variable selection methods are powerful in selecting relevant covariates

More information

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 10

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 10 COS53: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 0 MELISSA CARROLL, LINJIE LUO. BIAS-VARIANCE TRADE-OFF (CONTINUED FROM LAST LECTURE) If V = (X n, Y n )} are observed data, the linear regression problem

More information

Model Selection Tutorial 2: Problems With Using AIC to Select a Subset of Exposures in a Regression Model

Model Selection Tutorial 2: Problems With Using AIC to Select a Subset of Exposures in a Regression Model Model Selection Tutorial 2: Problems With Using AIC to Select a Subset of Exposures in a Regression Model Centre for Molecular, Environmental, Genetic & Analytic (MEGA) Epidemiology School of Population

More information

A robust hybrid of lasso and ridge regression

A robust hybrid of lasso and ridge regression A robust hybrid of lasso and ridge regression Art B. Owen Stanford University October 2006 Abstract Ridge regression and the lasso are regularized versions of least squares regression using L 2 and L 1

More information

Least squares under convex constraint

Least squares under convex constraint Stanford University Questions Let Z be an n-dimensional standard Gaussian random vector. Let µ be a point in R n and let Y = Z + µ. We are interested in estimating µ from the data vector Y, under the assumption

More information

Quantile universal threshold for model selection

Quantile universal threshold for model selection Quantile universal threshold for model selection Caroline Giacobino Sylvain Sardy Jairo Diaz Rodriguez Nick Hengartner In various settings such as regression, wavelet denoising and low rank matrix estimation,

More information

Discussion of Regularization of Wavelets Approximations by A. Antoniadis and J. Fan

Discussion of Regularization of Wavelets Approximations by A. Antoniadis and J. Fan Discussion of Regularization of Wavelets Approximations by A. Antoniadis and J. Fan T. Tony Cai Department of Statistics The Wharton School University of Pennsylvania Professors Antoniadis and Fan are

More information

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines Maximilian Kasy Department of Economics, Harvard University 1 / 37 Agenda 6 equivalent representations of the

More information

Pathwise coordinate optimization

Pathwise coordinate optimization Stanford University 1 Pathwise coordinate optimization Jerome Friedman, Trevor Hastie, Holger Hoefling, Robert Tibshirani Stanford University Acknowledgements: Thanks to Stephen Boyd, Michael Saunders,

More information

Statistical Inference

Statistical Inference Statistical Inference Liu Yang Florida State University October 27, 2016 Liu Yang, Libo Wang (Florida State University) Statistical Inference October 27, 2016 1 / 27 Outline The Bayesian Lasso Trevor Park

More information

Adaptive Wavelet Estimation: A Block Thresholding and Oracle Inequality Approach

Adaptive Wavelet Estimation: A Block Thresholding and Oracle Inequality Approach University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 1999 Adaptive Wavelet Estimation: A Block Thresholding and Oracle Inequality Approach T. Tony Cai University of Pennsylvania

More information

Variable Selection for Highly Correlated Predictors

Variable Selection for Highly Correlated Predictors Variable Selection for Highly Correlated Predictors Fei Xue and Annie Qu Department of Statistics, University of Illinois at Urbana-Champaign WHOA-PSI, Aug, 2017 St. Louis, Missouri 1 / 30 Background Variable

More information

Least Absolute Shrinkage is Equivalent to Quadratic Penalization

Least Absolute Shrinkage is Equivalent to Quadratic Penalization Least Absolute Shrinkage is Equivalent to Quadratic Penalization Yves Grandvalet Heudiasyc, UMR CNRS 6599, Université de Technologie de Compiègne, BP 20.529, 60205 Compiègne Cedex, France Yves.Grandvalet@hds.utc.fr

More information

Sparsity in Underdetermined Systems

Sparsity in Underdetermined Systems Sparsity in Underdetermined Systems Department of Statistics Stanford University August 19, 2005 Classical Linear Regression Problem X n y p n 1 > Given predictors and response, y Xβ ε = + ε N( 0, σ 2

More information

401 Review. 6. Power analysis for one/two-sample hypothesis tests and for correlation analysis.

401 Review. 6. Power analysis for one/two-sample hypothesis tests and for correlation analysis. 401 Review Major topics of the course 1. Univariate analysis 2. Bivariate analysis 3. Simple linear regression 4. Linear algebra 5. Multiple regression analysis Major analysis methods 1. Graphical analysis

More information

Simultaneous regression shrinkage, variable selection, and supervised clustering of predictors with OSCAR

Simultaneous regression shrinkage, variable selection, and supervised clustering of predictors with OSCAR Simultaneous regression shrinkage, variable selection, and supervised clustering of predictors with OSCAR Howard D. Bondell and Brian J. Reich Department of Statistics, North Carolina State University,

More information

A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables

A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables Niharika Gauraha and Swapan Parui Indian Statistical Institute Abstract. We consider the problem of

More information

sparse and low-rank tensor recovery Cubic-Sketching

sparse and low-rank tensor recovery Cubic-Sketching Sparse and Low-Ran Tensor Recovery via Cubic-Setching Guang Cheng Department of Statistics Purdue University www.science.purdue.edu/bigdata CCAM@Purdue Math Oct. 27, 2017 Joint wor with Botao Hao and Anru

More information

[y i α βx i ] 2 (2) Q = i=1

[y i α βx i ] 2 (2) Q = i=1 Least squares fits This section has no probability in it. There are no random variables. We are given n points (x i, y i ) and want to find the equation of the line that best fits them. We take the equation

More information

Variable Selection and Parameter Estimation Using a Continuous and Differentiable Approximation to the L0 Penalty Function

Variable Selection and Parameter Estimation Using a Continuous and Differentiable Approximation to the L0 Penalty Function Brigham Young University BYU ScholarsArchive All Theses and Dissertations 2011-03-10 Variable Selection and Parameter Estimation Using a Continuous and Differentiable Approximation to the L0 Penalty Function

More information

arxiv: v2 [math.st] 9 Feb 2017

arxiv: v2 [math.st] 9 Feb 2017 Submitted to the Annals of Statistics PREDICTION ERROR AFTER MODEL SEARCH By Xiaoying Tian Harris, Department of Statistics, Stanford University arxiv:1610.06107v math.st 9 Feb 017 Estimation of the prediction

More information

The Iterated Lasso for High-Dimensional Logistic Regression

The Iterated Lasso for High-Dimensional Logistic Regression The Iterated Lasso for High-Dimensional Logistic Regression By JIAN HUANG Department of Statistics and Actuarial Science, 241 SH University of Iowa, Iowa City, Iowa 52242, U.S.A. SHUANGE MA Division of

More information

arxiv: v1 [stat.me] 30 Dec 2017

arxiv: v1 [stat.me] 30 Dec 2017 arxiv:1801.00105v1 [stat.me] 30 Dec 2017 An ISIS screening approach involving threshold/partition for variable selection in linear regression 1. Introduction Yu-Hsiang Cheng e-mail: 96354501@nccu.edu.tw

More information

Statistics for high-dimensional data: Group Lasso and additive models

Statistics for high-dimensional data: Group Lasso and additive models Statistics for high-dimensional data: Group Lasso and additive models Peter Bühlmann and Sara van de Geer Seminar für Statistik, ETH Zürich May 2012 The Group Lasso (Yuan & Lin, 2006) high-dimensional

More information

Regularization Path Algorithms for Detecting Gene Interactions

Regularization Path Algorithms for Detecting Gene Interactions Regularization Path Algorithms for Detecting Gene Interactions Mee Young Park Trevor Hastie July 16, 2006 Abstract In this study, we consider several regularization path algorithms with grouped variable

More information