Variable selection in high-dimensional regression problems

Size: px

Start display at page:

Download "Variable selection in high-dimensional regression problems"

Jemima Morgan
5 years ago
Views:

1 Variable selection in high-dimensional regression problems Jan Polish Academy of Sciences and Warsaw University of Technology Based on joint research with P. Pokarowski, A. Prochenka, P. Teisseyre and M. Kubkowski

2 Outline Introduction: Penalized empirical risk minimisation Variable selection for linear and logistic regression Variable selection for misspecified logistic model - some comments

3 Penalized Empirical Risk Minimization (PERM) Data form: (y, x T ): y- response (quantitative or nominal), x = (x 1,..., x p ) T R p : vector of predictors. Penalized risk minimization framework: Data = {(y 1, x T 1 ),..., (y n, x T n )} = Train Valid Test β- model parameter, λ - penalty Fitting: β(λ) = arg min{err(β, Train) + penalty(β, λ)} β

4 Penalized Empirical Risk Minimization (PERM) Data form: (y, x T ): y- response (quantitative or nominal), x = (x 1,..., x p ) T R p : vector of predictors. Penalized risk minimization framework: Data = {(y 1, x T 1 ),..., (y n, x T n )} = Train Valid Test β- model parameter, λ - penalty Fitting: β(λ) = arg min{err(β, Train) + penalty(β, λ)} β Selection: λ ) = arg min err ( β(λ), Valid λ

5 Penalized Empirical Risk Minimization (PERM) Data form: (y, x T ): y- response (quantitative or nominal), x = (x 1,..., x p ) T R p : vector of predictors. Penalized risk minimization framework: Data = {(y 1, x T 1 ),..., (y n, x T n )} = Train Valid Test β- model parameter, λ - penalty Fitting: β(λ) = arg min{err(β, Train) + penalty(β, λ)} β Selection: λ ) = arg min err ( β(λ), Valid λ ) Assessment: êrr = err ( β( λ), Test

6 Penalized Empirical Risk Minimization Empirical risk err is generalization of prediction error and negative log-likelihood err(β, Train) = n L(y i, f (x i, β)) which is (usually) a convex function of β. L(y, f ): loss function. i=1

7 Penalized Empirical Risk Minimization Empirical risk err is generalization of prediction error and negative log-likelihood err(β, Train) = n L(y i, f (x i, β)) which is (usually) a convex function of β. L(y, f ): loss function. i=1 β = (β 1,..., β p ) T penalty(β, λ) = p P λ ( β j ) j=1 λ1(t > 0) P λ (t) λt 2

8 Classification loss functions Regression loss functions L binomial hinge square/4 prediction error L square eps insensitive Huber Margin y*f y f

9 Classical Penalty Functions Ridge Regression l 2 -penalty (Hoerl and Kennard (1970)) P λ (t) = λt 2

10 Classical Penalty Functions Ridge Regression l 2 -penalty (Hoerl and Kennard (1970)) P λ (t) = λt 2 Generalized Information Criterion (GIC AIC,BIC) l 0 -penalty (Nishi (1970)) P λ (t) = 2λ1(t > 0) Chen, Donoho, 1995, Tibshirani, 1996:

11 Classical Penalty Functions Ridge Regression l 2 -penalty (Hoerl and Kennard (1970)) P λ (t) = λt 2 Generalized Information Criterion (GIC AIC,BIC) l 0 -penalty (Nishi (1970)) P λ (t) = 2λ1(t > 0) Chen, Donoho, 1995, Tibshirani, 1996: Lasso l 1 -penalty P λ (t) = λt Important for high-dimensional problems: sparseness of the solution for Lasso induced by P λ (0+ ) > 0.

12 Validation criteria Choice of penalty: ) λ = arg min err ( β(λ), Valid λ err( β) = Ê β β 2 (estimation error) err( β) = Ê X( β β) 2 (prediction error) err( β) = P(yx T ˆβ < 0) (classification error) err( β) = P(supp β suppβ) (selection error) others: FDR control etc.

13 Selection consistency Selection consistency P( ˆT T ) is negligible for large n or equivalently Type I and II errors negligible for large n. Explanatory value; Fundamental property for correctness of post-model-selection inference.

14 Linear predictive models Why linear regression is so important? Linear predictive model is the cornerstone od prediction Ŷ = g(x T ˆβ) examples: neural nets, compressed sensing, generalized linear models (GLM), ARMA models etc. Linear model solution for two class classification problem works well.. It is not a fluke!

15 Linear model y = (y 1,..., y n ) T, X = [x 1.,..., x n. ] T = [x.1,..., x.p ]. y T 1 n = 0 and the columns are standardized: x Ṭ j 1 n = 0, x Ṭ j x.j = 1 for j = 1,..., p. Linear Regression Model y i = p β j x ij + ε i, j=1 i = 1,..., n ε = (ε 1,..., ε n ) T R n iid zero-mean errors.

16 High dimensionality and sparsity Aim. Operational algorithms of risk minimisation which work in high-dimensional setting. Two features of the problem: High-dimensionality: p > n or p >> n NP-dimensionality p exp(n α ) for some α > 0;

17 High dimensionality and sparsity Aim. Operational algorithms of risk minimisation which work in high-dimensional setting. Two features of the problem: High-dimensionality: p > n or p >> n NP-dimensionality p exp(n α ) for some α > 0; Sparsity: active set T = {i : β i 0} satisfies (bet on sparsity) T << min(n, p)

18 Bet on sparsity

19 Bet on sparsity (statistical insight) Consider ˆβ OLS T as an oracle benchmark. Then E X ˆβ OLS T Xβ 2 = σ 2 T n. Useless when T n. Simple approaches as OLS for all predictors p > n: not working (perfect fit on training data). Penalized approaches valuable as they can yield sparsity of the solution.

20 LASSO estimator in linear model Least Absolute Shrinkage and Selection Operator: β L β L (λ) = argmin γ { y Xγ 2 + 2λ γ 1 }

21 Penalty Functions: Lasso versus Ridge β 2 β 2 β β β 1 β 1

22 Inclusion of predictors by Lasso for prostate data Coefficients Log Lambda

23 Lasso acts as selector ˆT L = { ˆβ i,l 0},however Strong regularity conditions of moment matrix X T X are needed to ensure selection consistency of Lasso

24 Lasso acts as selector ˆT L = { ˆβ i,l 0},however Strong regularity conditions of moment matrix X T X are needed to ensure selection consistency of Lasso Lasso shrinks and selects at the same time. It tries to compensate for shrinked coefficient of a true predictor by incorporating a spurious predictor correlated with the true one.

25 Lasso acts as selector ˆT L = { ˆβ i,l 0},however Strong regularity conditions of moment matrix X T X are needed to ensure selection consistency of Lasso Lasso shrinks and selects at the same time. It tries to compensate for shrinked coefficient of a true predictor by incorporating a spurious predictor correlated with the true one. Lasso is rather screening than selection operator. (Bühlmann (2011)). Owns its popularity to easiness of computation (LARS for linear regression, Coordinate Descent for other frameworks)

26 Post-Lasso World Folded Concave Penalties (FCP): P λ (t) is increasing, concave and P λ (0) = 0; P λ (0+ ) > 0; P λ (t)= constant for t > γλ for some γ > 1;... Much more difficult algorithmically, but some approximate solutions such as LLA exist. SCAD, MCP, capped l 1 FCP

27 Post-Lasso World Folded Concave Penalties (FCP): P λ (t) is increasing, concave and P λ (0) = 0; P λ (0+ ) > 0; P λ (t)= constant for t > γλ for some γ > 1;... Much more difficult algorithmically, but some approximate solutions such as LLA exist. SCAD, MCP, capped l 1 FCP GIC MCP Lasso RR MCP approximates more closely l 0 penalty then Lasso.

28 Minimax Concave Penalty MCP thresholding functions P gamma = 25 gamma = 2.5 gamma = 1.1 β gamma = 25 gamma = 2.5 gamma = t β

29 Three properties of Lasso which can be used (at a price of conditions!) Selection Consistency (T = {i : β i 0}) ˆT L = T min ˆβ L,i > max ˆβ L,i = 0 i T i T Never satisfied under realistic assumptions.

30 Three properties of Lasso which can be used (at a price of conditions!) Selection Consistency (T = {i : β i 0}) ˆT L = T min ˆβ L,i > max ˆβ L,i = 0 i T i T Never satisfied under realistic assumptions. Separation: Lasso separates T from T : min ˆβ L,i > max ˆβ L,i i T i T Frequently fails (Su et al (2015)).

31 Three properties of Lasso which can be used (at a price of conditions!) Selection Consistency (T = {i : β i 0}) ˆT L = T min ˆβ L,i > max ˆβ L,i = 0 i T i T Never satisfied under realistic assumptions. Separation: Lasso separates T from T : min ˆβ L,i > max ˆβ L,i i T i T Frequently fails (Su et al (2015)). Screening: Lasso yields screening: ˆT L T min i T ˆβ L,i > 0 Holds under much milder conditions, Zou, 2006.

32 Our generic approach SOS algorithm, P. Pokarowski, JM, JMLR(2015) Use Lasso as a Screening Method; Order the obtained variables; Use Generalized Information Criterion.

33 Screening-Selection (SS) procedure Version of SOS (JMLR (2015)) with O removed.. Algorithm 1 SS Input: y, X and λ Screening (Lasso) β β(λ) = argmin γ { y Xγ 2 + 2λ γ 1 } ; order nonzero coefficients: β j1 β j2... β js, where s = supp β ; set J = {{j 1 }, {j 1, j 2 },..., {j 1,..., j s }}; Selection (GIC) T = argmin J J { SSEJ + λ 2 J } Output: β SS = (X T b T X bt ) 1 X T b T y

34 Limitations on selection consistency (statistical insight) To detect active set: dependence between active set and its complement has to be not to strong, or X T X = 2 β β T y Xβ 2 /2 is not too degenerate. What does this mean for p >> n?.

35 Figure: Strict convexity of risk over a certain cone (Tibshirani et al 2015))

36 Feasibility parameters Sign-restricted identifiability factor (SCIF) ζ T,a = X T Xν inf ν C T,a ν where C T,a for a (0, 1) is a certain cone. Restriction to C T,a ensures ζ T,a > 0 for many high-dimensional designs.

37 Feasibility parameters Sign-restricted identifiability factor (SCIF) ζ T,a = X T Xν inf ν C T,a ν where C T,a for a (0, 1) is a certain cone. Restriction to C T,a ensures ζ T,a > 0 for many high-dimensional designs. Scaled K-L distance Scaled K-L distance between T and its submodels is δ T = min J T (I H J )X T β T 2. T \ J

38 Bound for P( ˆT SS T ) (PP & JM, 2015) Theorem Under mild assumptions on feasibility parameters we have ( ) P( ˆT SS T ) p exp λ2 2σ 2 (some constants are omitted) For true regressors to be distinguishable from the noise β min = min i T β i has to be sufficiently large. Thus the condition ζ 2 T,aβ 2 min C > 0

39 Algorithm 2 SSnet (Screening Selection algorithm on a net of λs) Input: y, X and (λ 0, λ 1,..., λ m ) T Screening (Lasso) for k = 1 to m do β (k) β(λ { k ) = argmin γ y Xγ 2 + 2λ k γ } ; order nonzero coefficients: β (k) j 1 β (k) j 2... β (k) j sk, where s k = supp β (k) ; } set J k (y) = {{j 1 }, {j 1, j 2 },..., supp β (k) end for Selection (GIC) J(y) = m k=1 J k(y) { T = argmin J J(y) RJ + λ 2 0 J } Output: T, β SSnet = (X T b T X bt ) 1 X T b T y

40 SOSnet algorithm Use Lasso with λ i = 0, 1,..., m to choose set of predictors I i ; Fit linear model y x Ii, i = 0, 1,..., m; Order predictors according to (t-statistics) 2 ; Construct M = nested models ; Use GIC on M to choose a final model.

41 Numerical experiments Four groups of algorithms SS, SSnet, SOSnet MCP calibrated by GIC (sparsenet) MCP calibrated by CV (sparsenet, two settings) MCP (a = 1, 5 and a = 3) (PLUS) λ = σ 2 log(p), Penalization term for GIC: cλ 2 with three values of c {1, 1.5, 2, 2.5, 3, 3.5, 4}.

42 Experiments cont d M I: β 1 = (3, 1.5, 0, 0, 2, 0 T p 5 )T from Wang et al (2013) (p = 3000) M II: β 2 = (0 T p 10, ±2,, ±2)T Wang et al (2014) (p = 2000) signs ± chosen separately for every run. x 1,..., x p : normal with autoregressive (exp. a: ρ = 0.5,b: ρ = 0.7 ) or equicorrelated (exp. c: ρ = 0.5,d: ρ = 0.7 ) structure. n = 100 (M I) and n = 200 (M II).

43 Table: True Model selection (TM) (%). Exp 1a Exp 1b Exp 1c Exp 1d Exp 2a Exp 2b Exp 2c Exp 2d SS c SS c SS c SSnet c SSnet c SSnet c SOSnet c SOSnet c SOSnet c sparsenet c sparsenet c sparsenet c sparsenet p.1se sparsenet p.min mcp mcp

44 Table: Relative Mean Squared Error (MSE) Exp 1a Exp 1b Exp 1c Exp 1d Exp 2a Exp 2b Exp 2c Exp 2d SS c SS c SS c SSnet c SSnet c SSnet c SOSnet c SOSnet c SOSnet c sparsenet c sparsenet c sparsenet c sparsenet p.1se sparsenet p.min mcp mcp

45 Comments on results SOSnet: higher correct selection probability and lower MSE simultaneously in almost all experimental setups. The difference is most pronounced for higher correlations. Times for SOSnet > 2 times shorter than for sparsenet + GIC, >4 times shorter than for sparsenet + CV, > 20 times shorter than for PLUS implementation. Sparsenet tuned by GIC works much better than tuned by CV.

46 Binary case: remarks Conceptually the same. Change of a loss function, usually to logistic. More difficult algorithmically.

47 Binary case: remarks Conceptually the same. Change of a loss function, usually to logistic. More difficult algorithmically. Theoretical analysis more difficult due to heteroscedasticity of response.

48 Binary case: remarks Conceptually the same. Change of a loss function, usually to logistic. More difficult algorithmically. Theoretical analysis more difficult due to heteroscedasticity of response. NP-dimensional case: Filtering based on ranking of univariate fits (e.g.sis, Fan et al (2009)) and then PERM analysis to chosen subset.

49 Binary case: remarks Conceptually the same. Change of a loss function, usually to logistic. More difficult algorithmically. Theoretical analysis more difficult due to heteroscedasticity of response. NP-dimensional case: Filtering based on ranking of univariate fits (e.g.sis, Fan et al (2009)) and then PERM analysis to chosen subset. Fitting univariate (e.g. logistic) models to multivariate logistic data is an ultimate type of model misspecification.

50 Misspecified logistic model Different angle: Logistic loss in empirical risk minimisation fitting a logistic model. Data comes from binary model P(Y = 1 X) = q(β 0 + β T X) X is random variable in R p and q response function q p, p(β 0 + β T x) = exp(β 0 + β T x) 1 + exp(β 0 + β T x) is most frequently used tool to model dependence of binary outcome on attributes.

51 Important special cases: Omission of (some) valid predictors from logistic model itself, filters in particular. What happens when we misspecify response function and use logistic response p instead of q? Some bias in estimation of β surely occurs, but how important is an error? It is obvious that we cannot learn β when q is arbitrary, but what about direction of β? Can we learn suppβ?

52 Yes, we can (frequently, at least)

53 Simpler framework: minimization of empirical risk (p < n) ML ( ˆβ 0, ˆβ ML ) = arg min γ0,γ err(γ 0, γ). Using ( ˆβ 0 ML, ˆβ 0 ML) we estimate not β 0 and β but β0 and β such that (β 0β ) = argmin b0,b R pe XKL(q(β 0 + X T β), p(b 0 + X T b)), where KL(q, p) = q log ( ) ( ) q 1 q + (1 q) log p 1 p is Kullback-Leibler distance between two Bernoulli distributions with probabilities of success q and p.

54 What can go wrong.... X 2 (X 1 + ε) 2, P(y = 1 x) = q(x 1 ) 3 2 X2 1 0 q(x1) X1 Figure: Squares and triangles correspond to Y = 1 and Y = 0. Solid line shows the direction of ˆβ.

55 Positive result: Ruud s theorem (1983) Assume that distribution of X is nondegenerate and such that regressions with respect to β T X are linear E(X β T X) = uβ T X + u 0. (R) Then there exists η such that β = ηβ Important: η 0?

56 Relevance for selection of predictors (p < n) Order variables according to their residual deviances Dev f \{i1 } Dev f \{i2 }.. Dev f \{ip} and minimize GIC in the corresponding nested family. Then if (R) is satisfied, q is strictly monotone and.. ˆT GIC is consistent (P. Teisseyre, JM (2015))

57 Relevance for selection of predictors (p < n) Order variables according to their residual deviances Dev f \{i1 } Dev f \{i2 }.. Dev f \{ip} and minimize GIC in the corresponding nested family. Then if (R) is satisfied, q is strictly monotone and.. ˆT GIC is consistent (P. Teisseyre, JM (2015)) For the case when η > 1 we are frequently better off when misspecifing the model then fitting the correct one...

58 Correct selection versus η CS M2 M3 M4 M η^

59 PSR versus η PSR M2 M3 M4 M η^

60 FDR versus η FDR M2 M3 M4 M η^

61 For normal predictors we have (M. Kubkowski, JM 2016) η = Eq (β 0 + β T X) Ep (β 0 + β T X) = Eq (β 0 + β T X) Ep (β 0 + ηβt X)

62 For normal predictors we have (M. Kubkowski, JM 2016) η = Eq (β 0 + β T X) Ep (β 0 + β T X) = Eq (β 0 + β T X) Ep (β 0 + ηβt X) (Y, X) follow logistic model and βlin is a projection on a linear model. Then β lin = Ep (β 0 + β T X)β i.e. direction β/ β of β can be recovered by fitting a linear model.

63 Some relevant papers P. Pokarowski, J.,Combined l 1 and Greedy l 0 Penalized Least Squares for Linear Model Selection, Journal of Machine Learning Research, 2015 J., P. Teisseyre, What do we choose when we err? Model selection and testing for misspecified logistic model, Studies in Computational Intelligence, 2015 P. Ruud, Sufficient conditions for the consistency of maximum likelihood estimation despite misspecification of distribution in multinomial discrete choice models, Econometrica, 1983 W. Su, M. Bogdan and E. Candes, False discoveries occur early on the Lasso path, preprint T. Hastie, R. Tibshirani, M. Wainwright, Statistical Learning with Sparsity, CRC 2015

64 Machine Learning or Statistics?

65 Kilka prac z JMLR.. P. Bellec and A. Tsybakov, Sharp Oracle Bounds for Monotone and Convex Regression Through Aggregation, JMLR 2015 J. Jin and C-H. Zhang and Q. Zhang, Optimality of Graphlet Screening in High Dimensional Variable Selection, JMLR 2014 X. Li and T. Zhao and Yuan and H. Liu The flare Package for High Dimensional Linear Regression and Precision Matrix Estimation in R, JMLR 2015 P. Pokarowski and J. Combined l1 and Greedy l0 Penalized Least Squares for Linear Model Selection, JMLR 2015 M. Tan and I. W. Tsang and L. Wang Towards Ultrahigh Dimensional Feature Selection for Big Data, JMLR 2014

66 Kilka prac z Annals of Statistics.. P. Sherwood and L. Wang, Partially linear additive quantile regression in ultra-high dimension, AS 2016 R. Barber and E. Candes Controlling the false discovery rate via knockoffs AS 2015 Y. Yang and S. Tokdar Minimax-optimal nonparametric regression in high dimensions, AS 2015 B. Jiang and J. S. Liu Variable selection for general index models via sliced inverse regression, AS 2014 J. Fan, L. Xue, and H. Zou Strong oracle optimality of folded concave penalized estimation, AS 2014

67 Most cited statistical papers (Pokarowski, 2015) 1.3 The Most Cited Statistical Papers Citations in the Web of Science KaplanMeier 58 JASA Total:39,4 CoxModel 72 JRSS B Total:29,2 FDR 95 JRSS B Total:18,7 EM 77 JRSS B Total:17,2 AIC 74 IEEE TransAutomContr Total:15,7 BIC 78 AnnStat Total:12,0 RandomForests 01 MachLearn Total:7,6 SVM 95 MachLearn Total:6,8 Lasso 96 JRSS B Total:5, Years

68 Computational considerations Lasso regularized path solution requires flops using LARS; Selection step requires O(np min(n, p)) ns 2, s = supp ˆβ L flops. Use QR decomposition. This follows since J is nested! Screening step is the most expensive in this and other algorithms (to be presented).

High-dimensional Ordinary Least-squares Projection for Screening Variables

1 / 38 High-dimensional Ordinary Least-squares Projection for Screening Variables Chenlei Leng Joint with Xiangyu Wang (Duke) Conference on Nonparametric Statistics for Big Data and Celebration to Honor