High-dimensional regression with unknown variance

Size: px

Start display at page:

Download "High-dimensional regression with unknown variance"

Sheena Thornton
5 years ago
Views:

1 High-dimensional regression with unknown variance Christophe Giraud Ecole Polytechnique march 2012

2 Setting Gaussian regression with unknown variance: Y i = f i + ε i with ε i i.i.d. N (0, σ 2 ) f = (f 1,..., f n ) and σ 2 are unknown we want to estimate f Ex 1 : sparse linear regression f = X β with β sparse in some sense and X R n p with possibly p > n Ex 2 : non-parametric regression f i = F (x i ) with F : X R

3 A plethora of estimators Sparse linear regression Coordinate sparsity: Lasso, Dantzig, Elastic-Net, Exponential-Weighting, Projection on subspaces {V λ : λ Λ} given by PCA, Random Forest, etc. Structured sparsity: Group-lasso, Fused-Lasso, Bayesian estimators, etc Non-parametric regression Spline smoothing, Nadaraya kernel smoothing, kernel ridge estimators, nearest neighbors, L 2 -basis projection, Sparse Additive Models, etc

4 Important practical issues Which estimator should be used? Sparse regression : Lasso? Random-Forest? Exponential-Weighting? Non-parametric regression : Kernel regression? (which kernel?) Spline smoothing? Which tuning parameter? which penalty level for the lasso? which bandwith for kernel regression? etc

5 The objective Difficulties No procedure is universally better than the others A sensible choice of the tuning parameters depends on some unknown characteristics of f (sparsity, smoothness, etc) the unknown variance σ 2. Ideal objective Select the best estimator among a collection {ˆf λ, λ Λ}. (alternative objective: combine at best the estimators)

6 Impact of not knowing the variance

7 Impact of the unknown variance? Case of coordinate-sparse linear regression σ known or k known σ unknown and k unknown Minimax risk n k Ultra-high dimension 2klog(p/k) n Minimax prediction risk over k-sparse signal as a function of k

8 Ultra-high dimensional phenomenon Theorem (N. Verzelen EJS 2012) When σ 2 is unknown, there exist designs X of size n p such that for any estimator β, we have either [ X( β 0 p ) 2] > C 1 nσ 2, or sup E σ 2 >0 sup β 0 k-sparse σ 2 > 0 [ E X( β β 0 ) 2] > C 2 k log ( p ) [ k ( p ) ] exp C 3 k n log σ 2. k Consequence When σ 2 unknown, the best we can expect to have is [ E X( β β 0 ) 2] { C inf X(β β0 ) β 0 log(p)σ 2} β 0 for any σ 2 > 0 and any β 0 fulfilling 1 β 0 0 C n/ log(p).

9 Some generic selection schemes

10 Cross-Validation Hold-out V -fold CV Leave-q-out Penalized empirical lost Penalized log-likelihood (AIC, BIC, etc) Plug-in criteria (with Mallows C p, etc) Slope heuristic Approximation versus complexity penalization LinSelect

11 LinSelect (Y. Baraud, C. G. & S. Huet) Ingredients A collection S of linear spaces (for approximation) A weight function : S R + (measure of complexity) Criterion: Crit( f λ ) = inf S b S where residuals + approximation + complexity [ Y Π S fλ ] 2 f λ Π S fλ 2 + pen (S) σ S 2 Ŝ S, possibly data-dependent, Π S orthogonal projector onto S, pen (S) dim(s) 2 (S) when dim(s) 2 (S) 2n/3, σ 2 S = Y Π S Y 2 2 n dim(s).

12 Non-asymptotic risk bound Assumptions 1. 1 dim(s) 2 (S) 2n/3 for all S S, 2. S S e (S) 1. Theorem (Y. Baraud, C.G., S. Huet) [ E f f bλ 2] [ { C E inf f f λ 2 + inf { f λ Π S fλ 2 + [dim(s) (S)]σ 2}}] λ Λ S S b The bound also holds in deviation.

13 Sparse linear regression

14 Instantiation of LinSelect Estimators Linear regressor: { fλ = X β λ : λ Λ }. (e.g. Lasso, Exponential-Weighting, etc) Approximation and complexity { } S = range(x J ) : J {1,..., p}, 1 J n/(3 log p) (S) = log ( p dim(s)) + log(dim(s)) dim(s) log(p). Subcollection Ŝ ( ) We set Ŝλ = range X supp( ˆβλ ) and define Ŝ = {Ŝλ, λ Λ }, where Λ { } = λ Λ : Ŝ λ S.

15 Case of the Lasso estimators Lasso estimators { β λ = argmin Y Xβ 2 } + 2λ β 1, λ > 0 β Parameter tuning: theory For X with columns normalized to 1 λ σ 2 log(p) Parameter tuning: practice V -fold CV BIC criterion

16 Recent criterions pivotal with respect to the variance l 1 -penalized log-likelihood. (Stadler, Buhlmann, van de Geer) [ β λ LL, σll λ := argmin n log(σ ) + Y Xβ 2 2 β R p,σ >0 2σ 2 + λ β ] 1 σ. l 1 -penalized Huber s loss. (Belloni et al., Antoniadis) [ nσ β λ SR, σsr λ := argmin β R p,σ >0 2 + Y Xβ 2 2 2σ + λ β 1 Equivalent to Square-Root Lasso (introduced before) [ Y Xβ 22 + n λ ] β 1. β SR λ = argmin β R p Sun & Zhang : optimization with a single LARS-call ].

17 The compatibility constant κ[ξ, T ] = min u C(ξ,T ) where C(ξ, T ) = {u : u T c 1 < ξ u T 1 }. { T 1/2 Xu 2 / u T 1 }, Restricted eigenvalue For k = n/(3 log(p)) we set φ = sup { Xu 2 / u 2 : u k -sparse} Theorem for Square-Root Lasso (Sun & Zhang) For λ = 2 2 log(p), if we assume that β 0 0 C 1 κ 2 [4, supp(β 0 )] n log(p), then, with high probability, { } X( β β 0 ) 2 2 inf X(β 0 β) 2 β 0 log(p) 2 + C 2 β 0 κ 2 [4, supp(β)] σ2.

18 The compatibility constant κ[ξ, T ] = min u C(ξ,T ) where C(ξ, T ) = {u : u T c 1 < ξ u T 1 }. { T 1/2 Xu 2 / u T 1 }, Restricted eigenvalue For k = n/(3 log(p)) we set φ = sup { Xu 2 / u 2 : u k -sparse} Theorem for LinSelect Lasso If we assume that β 0 0 C 1 κ 2 [4, supp(β 0 )] n φ log(p), then, with high probability, { } X( β β 0 ) 2 2 C inf X(β 0 β) 2 β 0 log(p) 2 + C 2 β 0 φ κ 2 [4, supp(β)] σ2.

19 Numerical experiments (1/2) Tuning the Lasso 165 examples extracted from the literature each example e is evaluated on the basis of 400 runs Comparison to the oracle β λ procedure quantiles 0% 50% 75% 90% 95% Lasso 10-fold CV Lasso LinSelect Square-Root Lasso ] ] For each procedure l, quantiles of R [ βˆλl ; β 0 /R [ βλ ; β 0, for e = 1,..., 165.

20 Numerical experiments (2/2) Computation time n p 10-fold CV LinSelect Square-Root s 0.21 s 0.18 s s 0.43 s 0.4 s s 11 s 6.3 s Packages: enet for 10-fold CV and LinSelect lars for Square-Root Lasso (procedure of Sun & Zhang)

21 Non-parametric regression

22 An important class of estimators Linear estimators : fλ = A λ Y with A λ R n n spline smoothing or kernel ridge estimators with smoothing parameter λ R + Nadaraya estimators A λ with smoothing parameter λ R + λ-nearest neighbors, λ {1,..., k} L 2 -basis projection (on the λ first elements) etc Selection criterions (with σ 2 unknown) Cross-Validation schemes (including GCV) Mallows C L + plug-in / slope heuristic LinSelect

23 An important class of estimators Linear estimators : fλ = A λ Y with A λ R n n spline smoothing or kernel ridge estimators with smoothing parameter λ R + Nadaraya estimators A λ with smoothing parameter λ R + λ-nearest neighbors, λ {1,..., k} L 2 -basis projection (on the λ first elements) etc Selection criterions (with σ 2 unknown) Cross-Validation schemes (including GCV) Mallows C L + plug-in / slope heuristic LinSelect

24 Slope heuristic (Arlot & Bach) Procedure for f λ = A λ Y 1. compute λ 0 (σ ) = argmin λ { Y f } λ 2 + σ Tr(2A λ A λ A λ) 2. select σ such that Tr(Aˆλ 0 (ˆσ) ) [n/10, n/3] 3. select λ = argmin λ { Y f } λ σ 2 Tr(A λ ). Main assumptions A λ shrinkage or averaging matrix (covers all classics) Bias assumption : λ 1, Tr(A λ1 ) n and (I A λ1 )f 2 σ 2 n log(n) Theorem (Arlot & Bach) With high proba: fˆλ f 2 (1 + ε) inf λ f λ f 2 + C ε 1 log(n)σ 2

25 LinSelect Approximation spaces Ŝ = { } λ Sλ 1,..., S n/2 λ where Sλ k is spanned by the k last right-singular vectors of A + λ Π λ : range(a λ ) range(a λ ), where A + λ is the inverse of the of A λ to range(a λ ) range(a λ) Π λ is induced by the projection onto range(a λ ) Weight (S) = β ( 1 + dim(s) ) with β > 0 such that S e (S) 1. Corollary When σ n/2 (A + λ Π λ ) 1/2 for all λ Λ, we have [ E f f 2] [ C inf E f λ f 2] λ Λ

26 LinSelect Approximation spaces Ŝ = { } λ Sλ 1,..., S n/2 λ where Sλ k is spanned by the k last right-singular vectors of A + λ Π λ : range(a λ ) range(a λ ), Remark: when A λ is symmetric positive definite, Sλ k is spanned by the k first eigenvectors of A λ. Weight (S) = β ( 1 + dim(s) ) with β > 0 such that S e (S) 1. Corollary When σ n/2 (A + λ Π λ ) 1/2 for all λ Λ, we have [ E f f 2] [ C inf E f λ f 2] λ Λ

27 LinSelect Approximation spaces Ŝ = { } λ Sλ 1,..., S n/2 λ where Sλ k is spanned by the k last right-singular vectors of A + λ Π λ : range(a λ ) range(a λ ), Remark: when A λ is symmetric positive definite, Sλ k is spanned by the k first eigenvectors of A λ. Weight (S) = β ( 1 + dim(s) ) with β > 0 such that S e (S) 1. Corollary When σ n/2 (A + λ Π λ ) 1/2 for all λ Λ, we have [ E f f 2] [ C inf E f λ f 2] λ Λ

28 LinSelect Approximation spaces Ŝ = { } λ Sλ 1,..., S n/2 λ where Sλ k is spanned by the k last right-singular vectors of A + λ Π λ : range(a λ ) range(a λ ), Remark: when A λ is symmetric positive definite, Sλ k is spanned by the k first eigenvectors of A λ. Weight (S) = β ( 1 + dim(s) ) with β > 0 such that S e (S) 1. Corollary When σ n/2 (A + λ Π λ ) 1/2 for all λ Λ, we have with high proba f f 2 C inf λ Λ f λ f 2 + log(n)σ 2

29 A review High-dimensional regression with unknown variance C.G., S. Huet & N. Verzelen arxiv: (including coordinate-sparsity, group-sparsity, variation-sparsity and multivariate regression)

Direct Learning: Linear Regression. Donglin Zeng, Department of Biostatistics, University of North Carolina

Direct Learning: Linear Regression Parametric learning We consider the core function in the prediction rule to be a parametric function. The most commonly used function is a linear function: squared loss: