Asymptotic Equivalence of Regularization Methods in Thresholded Parameter Space

Size: px

Start display at page:

Download "Asymptotic Equivalence of Regularization Methods in Thresholded Parameter Space"

Oswin Summers
5 years ago
Views:

1 Asymptotic Equivalence of Regularization Methods in Thresholded Parameter Space Jinchi Lv Data Sciences and Operations Department Marshall School of Business University of Southern California jinchilv In collaboration with Yingying Fan Jinchi Lv, USC Marshall p. 1/59

2 Outline Motivation Regularization methods in thresholded parameter space Asymptotic equivalence of regularization methods Implementation Numerical studies Jinchi Lv, USC Marshall p. 2/59

3 Big Data Problems Commonly encountered in diverse fields ranging from genomics and health sciences to economics, finance, and machine learning. DNA microarrays used to produce expression measurements of tens of thousands of genes ( 20,000 genes in the human genome), or to identify hundreds of thousands of single nucleotide polymorphisms (SNPs) (500,000 1,000,000 SNPs in genome-wide association studies) Typical sample size: tens to hundreds for gene expression data, and hundreds to thousands for SNP data MRI and fmri data High-dimensional data: p n or p n. Big data: both n and p large Jinchi Lv, USC Marshall p. 3/59

4 Continued A heat map displaying microarray data (rows for subjects and columns for genes) with colors ranging from bright green (negative, under expressed) to bright red (positive, over expressed) Jinchi Lv, USC Marshall p. 4/59

5 Challenges and Impacts of High Dimensionality Effectively identifying important variables and efficiently estimating their effects are very challenging when p n Classical methods Best subset selection with AIC and BIC (L 0 -regularization): NP-complexity High collinearity, spurious correlation, and noise accumulation (Fan and Lv, 2008 & 2010) Concentration phenomenon of high dimension, low sample size data (Hall, Marron and Neeman, 2005) Impacts of dimensionality: probabilistic and non-probabilistic views (Lv, 2013) Jinchi Lv, USC Marshall p. 5/59

6 High-Dimensional Sparse Modeling A common strategy for estimating covariate effects is fitting a parametric model of p dimensions Issues of overfitting and model identifiability High-dimensional sparse modeling using regularization methods Empirical risk function + p p λ ( β j ) j=1 Convex penalties: Lasso (Tibshirani, 1996), Adaptive Lasso (Zou, 2006), Group Lasso (Yuan and Lin, 2006),... Concave penalties: SCAD (Fan, 1997; Fan and Li, 2001), MCP (Zhang, 2010), SICA (Lv and Fan, 2009),... Jinchi Lv, USC Marshall p. 6/59

7 Continued Related L 1 -regularization method: Dantzig selector (Candes and Tao, 07) Approximately equivalent to Lasso (Bickel, Ritov and Tsybakov, 2009) Exactly equivalent to Lasso (Meinshausen, Rocha and Yu, 2007; James, Radchenko and Lv, 2009) Asymptotic equivalence of hard-thresholding and L 0 -regularization (Zheng, Fan and Lv, 2014) Understanding the connections and differences of different regularization methods is important both theoretically and empirically Jinchi Lv, USC Marshall p. 7/59

8 Penalty Functions p λ ( θ ) SCAD MCP Hard Soft p λ ( θ ) SCAD MCP Hard Soft θ θ Jinchi Lv, USC Marshall p. 8/59

9 A Motivating Example Linear model: 100 simulations with (s,n,p) = (7,60,1000) and β 0 = (1, 0.5,0.7, 1.2, 0.9,0.3,0.55, 0 T ) T ; rows of X sampled as i.i.d. copies from N(0,I p ) and independently ε N(0,0.3 2 I n ) Probability of successfully recovering supp(β 0 ) Method Lasso SCAD Success % 1% 77% Jinchi Lv, USC Marshall p. 9/59

10 Continued An illustrative solution path 1.5 Lasso 1.5 SCAD Jinchi Lv, USC Marshall p. 10/59

11 Continued Will both methods share any similarity when viewed in an appropriate way? Method Lasso t SCAD t Success % 92% 95% An illustrative solution path Lasso t SCAD t Jinchi Lv, USC Marshall p. 11/59

12 Questions of Interest What are the connections and differences of all regularization methods? Will they be essentially the same under certain measures? Is there any interesting phase transition phenomenon? Jinchi Lv, USC Marshall p. 12/59

13 Regularization Methods in Thresholded Parameter Space Jinchi Lv, USC Marshall p. 13/59

14 Model Setup (x i,y i ) n i=1 : n independent observations from (x,y) in the generalized linear model (GLM) linking a p-dimensional predictor vector x to a scalar response Y With canonical link, the conditional distribution of Y given x has density f(y;θ,φ) = exp{yθ b(θ)+c(y,φ)}, where θ = x T β with β a p-dimensional regression coefficient vector, b( ) and c(, ) are known functions, and φ is dispersion parameter β 0 = (β 0,1,,β 0,p ) T is sparse with many zero components We allow logp = O(n a ) for some 0 < a < 1 Jinchi Lv, USC Marshall p. 14/59

15 Regularization Methods Log-likelihood function l n (β) = n { i=1 yi x T i β b(xt i β)+c(y i,φ) } Penalized negative log-likelihood Q n (β) = n 1{ y T Xβ 1 T b(xβ) } + p λ (β) 1, where y = (y 1,,y n ) T, X = (x 1,, x n ) T, b(θ) = (b(θ 1 ),,b(θ n )) T with θ = (θ 1,,θ n ) T, and p λ (β) 1 = p j=1 p λ( β j ) Each column of X is rescaled to have L 2 -norm n 1/2 Investigated in, e.g., Fan and Li (2001), Fan and Peng (2004), van de Geer (2008), Fan and Lv (2011),... Jinchi Lv, USC Marshall p. 15/59

16 Robust Spark The robust spark κ c of the n p design matrix X is defined as the smallest possible positive integer such that there exists an n κ c submatrix of n 1/2 X having a singular value less than a given positive constant c (Zheng, Fan and Lv, 2014) Bounding sparse model size to control collinearity and ensure model identifiability and stability This concept generalizes spark introduced in Donoho and Elad (2003) κ c n+1. As c 0+, κ c approaches the spark Jinchi Lv, USC Marshall p. 16/59

17 Continued Robust spark can be some large number diverging with n The order of κ c when X is generated from Gaussian distribution: Proposition 1. Assumelogp = o(n) and that the rows of then prandom design matrix X are independent and identically distributed (i.i.d.) as N(0, Σ), where Σ has smallest eigenvalue bounded from below by some positive constant. Then there exist positive constants c and c such that with asymptotic probability one,κ c cn/(logp). Jinchi Lv, USC Marshall p. 17/59

18 Thresholded Parameter Space We introduce a thresholded parameter space B τ,c = {β R p : β 0 < κ c /2 and for each j, β j = 0 or β j τ}, where β = (β 1,,β p ) T and τ is some positive threshold on parameter magnitude τ is key to distinguishing between important covariates and noise covariates for the purpose of variable selection τ typically needs to satisfy τ n/(logp) as n Motivated by best subset regression with L 0 -regularization, oracle risk inequalities under prediction loss (Barron, Birge and Massart, 1999) Jinchi Lv, USC Marshall p. 18/59

19 Artificial or Natural? Hard-thresholding property: Proposition 2. For thel 0 -penaltyp λ (t) = λ1 {t 0}, the global minimizer β = ( β 1,, β p ) T of the regularization problem overr p satisfies that each component β j is either 0 or has magnitude larger than some positive threshold. Shared by many other penalties such as hard-thresdholding and SICA penalties If some covariates have weak effects, we can keep them out of the model to improve prediction accuracy with reduced estimation variability Weak signals are generally difficult to stand out compared to noise variables due to impact of high dimensionality Jinchi Lv, USC Marshall p. 19/59

20 Asymptotic Equivalence of Regularization Methods Jinchi Lv, USC Marshall p. 20/59

21 Two Key Events Consider a universal λ = c 0 (logp)/n, with c0 > 0 and p understood implicitly as n p Two key events: E = { n 1 X T ε λ/2 } and E 0 = } { n 1 X T α0 ε c 0 (logn)/n, with X α a submatrix of X consisting of columns in α, α 0 = supp(β 0 ), and ε = (ε 1,,ε n ) T = Y EY Jinchi Lv, USC Marshall p. 21/59

22 Technical Conditions Condition 1 (Error tail distribution): P(E c ) = O(p c 1 ) and P(E c 0) = O(n c 1 ) for some positive constant c 1 that can be sufficiently large for large enough c 0 Condition 2 (Bounded variance): b(θ) satisfies that c 2 b (θ) c 1 2 in its domain, where c 2 is some positive constant Condition 3 (Concave penalty function): p λ (t) is increasing and concave in t [0, ) with p λ (0) = 0, and is differentiable with p λ (0+) = c 3λ for some positive constant c 3 (Lv and Fan, 2009; Fan and Lv, 2011) A wide class of penalties, including L 1 -penalty in Lasso, SCAD, MCP, and SICA, satisfy Condition 3 Jinchi Lv, USC Marshall p. 22/59

23 Continued Condition 4 (Ultra-high dimensionality): logp = O(n a ) for some constant a (0,1) Condition 5 (True parameter vector): s = o(n 1 a ) and there exists a constant c > 0 such that the robust spark κ c > 2s. Moreover, min 1 j s β 0,j (logp)/n Jinchi Lv, USC Marshall p. 23/59

24 Oracle Inequalities of Global Minimizer Any global minimizer β = argmin β Bτ Q n (β) Oracle inequalities were established in Bickel, Ritov and Tsybakov (2009) to study asymptotic equivalence of Lasso and Dantzig selector Total number of falsely discovered signs { FS( β) = j : sgn( β j ) sgn(β 0,j ),1 j p}, where β = ( β 1,, β p ) T Jinchi Lv, USC Marshall p. 24/59

25 Continued Theorem 1 (Oracle inequalities). Assume that Conditions 1 5 hold and τ is chosen such thatτ < min 1 j s β 0,j andλ = c 0 (logp)/n = o(τ). Then the global minimizer defined exists, and any such global minimizer satisfies that with probability at least1 O(p c 1 ), it holds simultaneously that: (a) (False signs). FS( β) Csλ 2 τ 2 /(1 Cλ 2 τ 2 ); (b) (Estimation losses). β β 0 q Cλs 1/q (1 Cλ 2 τ 2 ) 1/q for each q [1,2] and β β 0 Cλs 1/2 (1 Cλ 2 τ 2 ) 1/2 ; (c) (Prediction loss). n 1/2 X( β β 0 ) 2 Cλs 1/2 (1 Cλ 2 τ 2 ) 1/2, where C is some positive constant. Jinchi Lv, USC Marshall p. 25/59

26 Continued Results hold uniformly over the set of all possible global minimizers c 1 in probability bound can be chosen arbitrarily large, affecting only C FS( β) = o(s) since λ = o(τ), while β Lasso 0 = O(φ max s) with φ max largest eigenvalue of n 1 X T X (Bickel, Ritov and Tsybakov, 2009) { for each q [1,2], β β 0 q = O s 1/q } (logp)/n and n 1/2 X( β β 0 ) 2 = O( s(logp)/n), convergence rates consistent with those in Bickel, Ritov and Tsybakov (2009) for Lasso Jinchi Lv, USC Marshall p. 26/59

27 Continued Theorem 2 (Sign consistency and oracle inequalities). Assume that conditions of Theorem 1 hold withmin 1 j s β 0,j 2τ,λ = c 0 (logp)/n = o(s 1/2 τ), andγ n = o { τ n/(slogn) }. Then any global minimizer β defined satisfies that with probability at least1 O(n c 1 ), it holds simultaneously that: (a) (Sign consistency). sgn( β) = sgn(β 0 ); (b) (Estimation and prediction losses). If the penalty function further satisfies p λ (τ) = O{ (logn)/n }, then we have for eachq [1,2], β β 0 q Cs 1/q (logn)/n, β β 0 Cγn (logn)/n, andn 1 D( β) Cs(logn)/n, whereγn is a constant showing the behavior of {1 n XT α 0 H(β 1,,β n )X α0 } 1 in a small neighborhood ofβ 0,D( β) is the Kullback-Leibler divergence, andc is some positive constant. Jinchi Lv, USC Marshall p. 27/59

28 Continued In linear model, γn = 1 ) 1 ( n XT α X 0 α 0 and γ n = sup α {s+1,,p} and α s 1 n XT α 0 X α γ n s 1/2 (1 n XT α 0 X α0 ) 1 2 c 1 s 1/2 When all true covariates are orthogonal to each other, γ n = 1 and β β 0 C (logn)/n, within a logarithmic factor logn of oracle rate Condition p λ (τ) = O{ (logn)/n } can be easily satisfied by concave penalties such as SCAD and SICA, having convergence rates improved with logn in place of logp Jinchi Lv, USC Marshall p. 28/59

29 Phase Transition Phenomenon Combining Theorems 1 and 2 shows that for p = O(n a ), Lasso and concave regularization methods are asymptotically equivalent, having the same convergence rates in the oracle inequalities, with a logarithmic factor of log n For logp = O(n a ), concave regularization methods are asymptotically equivalent and still enjoy the same convergence rates in the oracle inequalities, with a logarithmic factor of log n. For Lasso, the condition p λ (τ) = O{ (logn)/n } and the choice of λ = c 0 (logp)/n are incompatible with each other in this case Jinchi Lv, USC Marshall p. 29/59

30 Continued In ultra-high dimensional case, the convergence rates for Lasso, which have a logarithmic factor of logp, are slower than those for concave regularization methods An interesting phase diagram on how the performance of regularization methods, in the thresholded parameter space, evolves with dimensionality and penalty function Jinchi Lv, USC Marshall p. 30/59

31 Oracle Risk Inequalities of Global Minimizer Theorem 3 (Oracle risk inequalities). Assume that conditions of Theorem 2 hold and the fourth moments of errorseε 4 i are uniformly bounded. Then any global minimizer β defined satisfies that: (a) (Sign risk). E { FS( β) } = 1 p λ (τ){ [ pλ (β 0 ) 1 +sλ 2 ]O(n c 1 )+O(p c 1/2 κ c ) } ; (b) (Estimation and prediction risks). If the penalty function further satisfies p λ (τ) = O{ (logn)/n }, then we have for eachq [1,2], E β β 0 q q Cs [ (logn)/n ] q/2, E β β0 Cγ n (logn)/n, ande { n 1 D( β) } Cs(logn)/n, where C is some positive constant. Jinchi Lv, USC Marshall p. 31/59

32 Continued E { FS( β) } converges to zero at a polynomial rate of n Consistent with the risk bounds O{s(log n)/n} of the regularized estimators under the L 2 -loss in wavelets setting with orthogonal design (Antoniadis and Fan, 2001) No additional cost in risk bounds for generalizing to the ultra-high dimensional nonlinear model setting of GLM Jinchi Lv, USC Marshall p. 32/59

33 Sampling Properties of Computable Solutions How about computable solution produced by any algorithm which may not be the global minimizer? A computable solution produced by any algorithm can share the same nice asymptotic properties as for any global minimizer, when the maximum correlation between the covariates and the residual vector y µ(x β) is a smaller order of the threshold τ, where µ(θ) = (b (θ 1 ),,b (θ n )) T Jinchi Lv, USC Marshall p. 33/59

34 Continued Theorem 4. Let β B τ be a computable solution to the minimization problem produced by any algorithm that is the global minimizer when constrained on the subspace given bysupp( β), andη n = n 1 X T [y µ(x β)]. Assume in addition that there exists some positive constantc 4 such that n 1 X T α[µ(xβ) µ(xβ 0 )] 2 c 4 β β 0 2 for anyβ B τ and α = supp(β) supp(β 0 ), if the model is nonlinear. Ifη n +λ = o(τ) and min 1 j s β 0,j > c 5 s 1/2 (η n +λ) withc 5 some sufficiently large positive constant, then β enjoys the same asymptotic properties as for any global minimizer in Theorems 1 3 under the same conditions therein. Jinchi Lv, USC Marshall p. 34/59

35 Implementation Jinchi Lv, USC Marshall p. 35/59

36 Implementation Lasso-type methods: LARS algorithm (Efron et al., 2004); nonconcave penalized likelihood methods: LQA algorithm (Fan and Li, 2001) and LLA algorithm (Zou and Li, 2008) An alternative algorithm for solving large-scale problems: coordinate optimization (Friedman et al., 2007; Wu and Lange, 2008) ICA algorithm (Fan and Lv, 2011): implementing nonconcave penalized likelihood methods with second-order quadratic approximation of likelihood function and coordinate optimization (see, e.g., Lin and Lv (2013) for convergence analysis) Jinchi Lv, USC Marshall p. 36/59

37 ICA Algorithm For each coordinate within each iteration, we solve the univariate penalized least-squares problem with the corresponding quadratic approximation of the likelihood function, and update this coordinate only when the global minimizer has magnitude above the given threshold τ Thresholded parameter space naturally puts a constraint on each component Thresholding also induces additional sparsity of regularized estimate, making the algorithm converge faster Jinchi Lv, USC Marshall p. 37/59

38 Algorithm Stability Assume p λ (t) has maximum concavity ρ(p λ ) = sup 0<t1 <t 2 < { p λ (t 2) p λ (t 1) t 2 t 1 } < cc 2, with constants c and c 2 It ensures that Q n (β) is strictly convex on a union of coordinate subspaces {β R p : β 0 < κ c }, which is key to stability of sparse solution found by any algorithm This condition holds for many concave penalties For example, L 1 -penalty p λ (t) = λt in Lasso has maximum concavity 0, SCAD p λ (t) has ρ(p λ ) = (a 1) 1, and SICA p λ (t;a) = λ(a+1)t/(a+t) has maximum concavity 2λ(a 1 +a 2 ) Jinchi Lv, USC Marshall p. 38/59

39 Numerical Studies Jinchi Lv, USC Marshall p. 39/59

40 Simulation setting GLMs with p n 100 simulations with data generated from the linear model, logistic model, and Poisson model, respectively; the mean response vector depends on Xβ 0 The rows of X were sampled as i.i.d. copies from N(0,Σ) with Σ = (r j k ) 1 j,k p for some number r In linear model, error ε N(0,σ 2 I n ) independent of X, with n = 100, σ = 0.4, and β 0 = (1, 0.5,0.7, 1.2, 0.9,0.5,0.55, 0,,0) T Considered (p, r) = (1000, 0.25), (1000, 0.5), and (5000, 0.25) Jinchi Lv, USC Marshall p. 40/59

41 Continued Compared the Lasso, SCAD, and SICA in the thresholded parameter space, as well as SCAD, with oracle procedure Choose τ = c 6 (logn) 1/2 (logp)/n for some positive constant c 6 Since main purpose of our simulation study is to justify theoretical results, we use an independent validation set with size equal to sample size to select tuning parameters Jinchi Lv, USC Marshall p. 41/59

42 Continued Performance measures Prediction error (PE): E(Y x T β) 2 with β an estimate and (x T,Y) an independent observation; calculated based on independent test sample of size 10, 000 L q -losses: β β 0 q with q = 2,1, and FP: number of falsely selected noise covariates FN: number of missed true covariates Model selection consistency probability based on 100 simulations Estimate σ of error standard deviation σ Jinchi Lv, USC Marshall p. 42/59

43 Simulation results: linear model PE Lasso_t SCAD SCAD_t SICA_t Oracle L 2 loss Lasso_t SCAD SCAD_t SICA_t Oracle FP 20 FN Lasso_t SCAD SCAD_t SICA_t Oracle 0.5 Lasso_t SCAD SCAD_t SICA_t Oracle Figure 1: Boxplots of the PE, L 2 -loss, FP, and FN over 100 simulations for all methods in Section 4.2.1, with (p, r) = (5000, 0.25). The x-axis represents different methods. Jinchi Lv, USC Marshall p. 43/59

44 Continued Table 1: The means and standard errors (in parentheses) of various performance measures as well as the estimated error standard deviation for all methods in Section 4.2.1; settings I, II, and III refer to cases of (p,r) = (1000,0.25), (1000,0.5), and (5000,0.25), respectively Measure Method Lasso t SCAD SCAD t SICA t Oracle Setting I PE ( 0.1) (0.007) (0.007) (0.007) (0.007) (0.007) L 2 -loss ( 0.1) (0.032) (0.030) (0.030) (0.031) (0.031) L 1 -loss ( 0.1) (0.077) (0.100) (0.071) (0.071) (0.071) L -loss ( 0.01) 7.48 (0.24) 7.67 (0.21) 7.61 (0.21) 7.55 (0.23) 7.55 (0.23) FP 0.01 (0.01) 3.84 (0.47) 0 (0) 0 (0) 0 (0) FN 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) σ ( 0.1) (0.035) (0.034) (0.034) (0.034) (0.034) Setting II PE ( 0.1) (0.045) (0.008) (0.008) (0.019) (0.007) L 2 -loss ( 0.1) (0.100) (0.039) (0.040) (0.062) (0.038) L 1 -loss ( 0.1) (0.318) (0.118) (0.108) 2.957(0.132) (0.088) L -loss ( 0.01) 9.42 (0.65) 8.99 (0.28) 8.95 (0.28) 9.22 (0.49) 8.76 (0.26) FP 0.22 (0.18) 4.11 (0.48) 0.56 (0.12) 0.01 (0.01) 0 (0) FN 0.01 (0.01) 0 (0) 0 (0) 0.01 (0.01) 0 (0) σ ( 0.1) (0.033) (0.034) (0.036) (0.035) (0.034) Setting III PE ( 0.1) (0.008) (0.008) (0.007) (0.008) (0.006) L 2 -loss ( 0.1) (0.034) (0.033) (0.032) (0.034) (0.031) L 1 -loss ( 0.1) (0.074) (0.139) (0.071) (0.075) (0.070) L -loss ( 0.01) 7.77 (0.28) 7.79 (0.24) 7.61 (0.26) 7.80 (0.29) 7.43 (0.24) FP 0.02 (0.01) 8.25 (0.84) 0.01 (0.01) 0.06 (0.02) 0 (0) FN 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) σ ( 0.1) (0.032) (0.034) (0.031) (0.032) (0.031) Jinchi Lv, USC Marshall p. 44/59

45 Continued Table 2: Model selection consistency probabilities of all methods in Section Setting of (p, r) Model selection consistency probability Lasso t SCAD SCAD t SICA t Oracle (1000, 0.25) (1000, 0.5) (5000, 0.25) Jinchi Lv, USC Marshall p. 45/59

46 Continued Table 3: The means and standard errors (in parentheses) of various performance measures as well as the estimated error standard deviation for all methods in Section with (p,r) = (5000,0.5); settings I, II, and III refer to cases of n = 100, 200, and 400, respectively Measure Method Lasso t SCAD SCAD t SICA t Oracle Setting I PE ( 0.1) (0.215) (0.105) (0.062) (0.134) (0.006) L 2 -loss ( 0.1) (0.343) (0.187) (0.126) (0.243) (0.039) L 1 -loss ( 0.1) (0.841) (0.523) (0.280) (0.535) (0.089) L -loss ( 0.01) (2.21) (1.25) (0.91) (1.68) 8.63 (0.29) FP 0.19 (0.07) (1.00) 0.91 (0.17) 0.08 (0.03) 0 (0) FN 0.41 (0.09) 0.06 (0.03) 0.05 (0.04) 0.21 (0.07) 0 (0) σ ( 0.1) (0.111) (0.061) (0.050) (0.082) (0.031) Setting II PE ( 0.1) (0.004) (0.004) (0.004) (0.004) (0.004) L 2 -loss ( 0.1) (0.034) (0.032) (0.031) (0.031) (0.031) L 1 -loss ( 0.1) (0.079) (0.141) (0.071) 1.952(0.071) (0.072) L -loss ( 0.01) 6.08 (0.22) 6.21 (0.22) 6.19 (0.22) 5.98(0.22) 6.00 (0.22) FP 0 (0) 4.82 (1.23) 0 (0) 0 (0) 0 (0) FN 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) σ ( 0.1) (0.020) (0.023) (0.020) (0.020) (0.020) Setting III PE ( 0.1) (0.003) (0.003) (0.003) (0.003) (0.003) L 2 -loss ( 0.1) (0.020) (0.020) (0.019) (0.019) (0.019) L 1 -loss ( 0.1) (0.048) (0.084) (0.045) (0.044) (0.044) L -loss ( 0.01) 4.39 (0.13) 4.43 (0.13) 4.42 (0.13) 4.31 (0.13) 4.36 (0.13) FP 0 (0) 3.67 (1.11) 0 (0) 0 (0) 0 (0) FN 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) σ ( 0.1) (0.012) (0.013) (0.012) (0.012) (0.012) Jinchi Lv, USC Marshall p. 46/59

47 Continued Table 4: Model selection consistency probabilities of all methods in Section with(p, r) = (5000, 0.5) n Model selection consistency probability Lasso t SCAD SCAD t SICA t Oracle Jinchi Lv, USC Marshall p. 47/59

48 Simulation results: logistic model In logistic model, response vector y was sampled from Bernoulli distribution with success probability vector (e θ 1 /(1+e θ 1 ),,e θ n /(1+e θ n )) T, with (θ 1,,θ n ) T = Xβ 0 n = 200 and β 0 = (2,0, 2.3,0,2.8,0, 2.2,0,2.5,0,,0) T Prediction error: E{Y exp(x T β)/[1+exp(x 2 T β)]} Jinchi Lv, USC Marshall p. 48/59

49 Continued PE L 2 loss Lasso_t SCAD SCAD_t SICA_t Oracle 0 Lasso_t SCAD SCAD_t SICA_t Oracle FP 0.4 FN Lasso_t SCAD SCAD_t SICA_t Oracle 0 Lasso_t SCAD SCAD_t SICA_t Oracle Figure 2: Boxplots of the PE, L 2 -loss, FP, and FN over 100 simulations for all methods in Section 4.2.2, with p = The x-axis represents different methods. Jinchi Lv, USC Marshall p. 49/59

50 Continued Table 5: The means and standard errors (in parentheses) of various prediction and variable selection performance measures for all methods in Section 4.2.2; settings I, II, and III refer to cases of (p,r) = (1000,0.25), (1000,0.5), and (5000,0.25), respectively Measure Method Lasso t SCAD SCAD t SICA t Oracle Setting I PE ( 0.01) 7.89 (0.03) 7.97 (0.07) 7.86 (0.03) 7.88 (0.04) 7.86 (0.03) L 2 -loss (0.039) (0.096) (0.052) (0.051) (0.049) L 1 -loss (0.087) (0.271) (0.108) (0.107) (0.103) L -loss ( 0.1) (0.224) (0.509) (0.345) (0.346) (0.333) FP 0.02 (0.01) 0.09 (0.05) 0 (0) 0.01 (0.01) 0 (0) FN 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) Setting II PE ( 0.01) 9.09 (0.07) 9.13 (0.07) 9.00 (0.05) 9.04 (0.06) 8.94 (0.03) L 2 -loss ( 0.059) (0.072) (0.059) (0.055) (0.049) L 1 -loss (0.135) (0.168) (0.129) ( 0.120) (0.103) L -loss ( 0.1) (0.334) (0.408) (0.360) (0.338) (0.314) FP 0.06 (0.02) 0.16 (0.05) 0.04 (0.02) 0.06 (0.03) 0 (0) FN 0.01 (0.01) 0 (0) 0 (0) 0 (0) 0 (0) Setting III PE ( 0.01) 7.89 (0.04) 7.96 (0.07) 7.94 (0.07) 7.98 (0.08) 7.85 (0.04) L 2 -loss (0.053) (0.089) (0.083) (0.084) (0.079) L 1 -loss (0.123) (0.199) (0.181) (0.182) (0.172) L -loss ( 0.1) (0.301) (0.568) (0.547) (0.560) (0.501) FP 0.03 (0.02) 0.03 (0.02) 0.01 (0.01) 0.01 (0.01) 0 (0) FN 0 (0) 0.02 (0.01) 0.02 (0.01) 0.03 (0.02) 0 (0) Jinchi Lv, USC Marshall p. 50/59

51 Continued Table 6: Model selection consistency probabilities of all methods in Section Setting of (p, r) Model selection consistency probability Lasso t SCAD SCAD t SICA t Oracle (1000, 0.25) (1000, 0.5) (5000, 0.25) Jinchi Lv, USC Marshall p. 51/59

52 Simulation results: Poisson model In Poisson model, response vector y was sampled from Poisson distribution with mean vector (e θ 1,,e θ n ) T, with (θ 1,,θ n ) T = Xβ 0 n = 200 and β 0 = (1, 0.9,0.8, 1.1,0.6,0,,0) T Prediction error: E[Y exp(x T β)] 2 Jinchi Lv, USC Marshall p. 52/59

53 Continued PE L 2 loss Lasso_t SCAD SCAD_t SICA_t Oracle 0 Lasso_t SCAD SCAD_t SICA_t Oracle FP FN Lasso_t SCAD SCAD_t SICA_t Oracle Lasso_t SCAD SCAD_t SICA_t Oracle Figure 3: Boxplots of the PE, L 2 -loss, FP, and FN over 100 simulations for all methods in Section 4.2.3, with p = The x-axis represents different methods. Jinchi Lv, USC Marshall p. 53/59

54 Continued Table 7: The 5% trimmed means and standard errors (in parentheses) of various prediction and variable selection performance measures for all methods in Section 4.2.3; settings I, II, and III refer to cases of (p,r) = (1000,0.25), (1000,0.5), and (5000,0.25), respectively Measure Method Lasso t SCAD SCAD t SICA t Oracle Setting I PE (2.23) (0.94) 9.00 (0.66) 7.39 (0.50) 6.22 (0.22) L 2 -loss ( 0.01) (1.44) (0.71) (0.67) 9.05 (0.45) 7.94 (0.31) L 1 -loss ( 0.1) (0.440) (0.191) (0.173) (0.086) (0.060) L -loss ( 0.01) (0.75) (0.58) 7.82 (0.43) 6.69 (0.37) 5.69 (0.23) FP 1.47 (0.28) (0.67) 1.20 (0.22) 0.07 (0.03) 0 (0) FN 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) Setting II PE (0.272) (0.065) (0.055) (0.072) (0.030) L 2 -loss ( 0.01) (1.78) (0.60) (0.69) (0.82) (0.48) L 1 -loss ( 0.1) (0.651) (0.225) (0.192) 2.357(0.143) (0.097) L -loss ( 0.01) (0.75) (0.49) 9.92 (0.45) 9.15 (0.69) 8.39 (0.35) FP 2.54 (0.23) (0.84) 1.70 (0.34) 0 (0) 0 (0) FN 0 (0) 0 (0) 0 (0) 0.01 (0.01) 0 (0) Setting III PE (3.04) (1.00) (0.71) 8.28 (0.51) 6.15 (0.20) L 2 -loss ( 0.01) (1.92) (0.55) (0.76) (0.72) 8.35 (0.31) L 1 -loss ( 0.1) (0.617) (0.175) (0.260) 2.252(0.173) (0.062) L -loss ( 0.01) (0.88) (0.48) 9.46 (0.46) 8.48 (0.46) 6.21 (0.23) FP 2.36 (0.28) (0.71) 3.06 (0.49) 0.39 (0.08) 0 (0) FN 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) Jinchi Lv, USC Marshall p. 54/59

55 Continued Table 8: Model selection consistency probabilities of all methods in Section Setting of (p, r) Model selection consistency probability Lasso t SCAD SCAD t SICA t Oracle (1000, 0.25) (1000, 0.5) (5000, 0.25) Jinchi Lv, USC Marshall p. 55/59

56 Application to genomic data Prostate cancer data set in Singh et al. (2002) consists of 136 patient samples with 77 from prostate tumor group and 59 from normal group, each patient having gene expression measurements for 12,600 genes Randomly split all samples into a training set of 52 samples from cancer class and 50 samples from normal class, and a test set of 25 samples from cancer class and 9 samples from normal class For each splitting, we fit logistic regression model to training data with regularization methods, and then calculated classification error using test data We repeated random splitting 50 times, and tuning parameters were selected by fivefold CV Jinchi Lv, USC Marshall p. 56/59

57 Continued Table 9: The means and standard errors of classification errors by different methods over 50 random splittings of the prostate cancer data in Section 4.3 Lasso t SCAD SCAD t SICA t Mean Standard error Jinchi Lv, USC Marshall p. 57/59

58 Continued Table 10: Selection probabilities of most frequently selected genes with number up to median model size by each method across 50 random splittings of the prostate cancer data in Section 4.3 Gene ID Lasso t SCAD SCAD t SICA t Gene ID Lasso t SCAD SCAD t SICA t Jinchi Lv, USC Marshall p. 58/59

59 Conclusions We have studied asymptotic equivalence of two popular classes of regularization methods, convex ones and concave ones, in high-dimensional GLMs Oracle inequalities as well as stronger oracle risk inequalities of global minimizer and sampling properties of computable solutions for regularization methods have been established to characterize their connections and differences When p = O(n a ), all regularization methods under consideration including Lasso are asymptotically equivalent, while when logp = O(n a ), concave methods are asymptotically equivalent but have faster convergence rates than Lasso Jinchi Lv, USC Marshall p. 59/59

High dimensional thresholded regression and shrinkage effect

J. R. Statist. Soc. B (014) 76, Part 3, pp. 67 649 High dimensional thresholded regression and shrinkage effect Zemin Zheng, Yingying Fan and Jinchi Lv University of Southern California, Los Angeles, USA