Asymptotic Equivalence of Regularization Methods in Thresholded Parameter Space

Size: px
Start display at page:

Download "Asymptotic Equivalence of Regularization Methods in Thresholded Parameter Space"

Transcription

1 Asymptotic Equivalence of Regularization Methods in Thresholded Parameter Space Jinchi Lv Data Sciences and Operations Department Marshall School of Business University of Southern California jinchilv In collaboration with Yingying Fan Jinchi Lv, USC Marshall p. 1/59

2 Outline Motivation Regularization methods in thresholded parameter space Asymptotic equivalence of regularization methods Implementation Numerical studies Jinchi Lv, USC Marshall p. 2/59

3 Big Data Problems Commonly encountered in diverse fields ranging from genomics and health sciences to economics, finance, and machine learning. DNA microarrays used to produce expression measurements of tens of thousands of genes ( 20,000 genes in the human genome), or to identify hundreds of thousands of single nucleotide polymorphisms (SNPs) (500,000 1,000,000 SNPs in genome-wide association studies) Typical sample size: tens to hundreds for gene expression data, and hundreds to thousands for SNP data MRI and fmri data High-dimensional data: p n or p n. Big data: both n and p large Jinchi Lv, USC Marshall p. 3/59

4 Continued A heat map displaying microarray data (rows for subjects and columns for genes) with colors ranging from bright green (negative, under expressed) to bright red (positive, over expressed) Jinchi Lv, USC Marshall p. 4/59

5 Challenges and Impacts of High Dimensionality Effectively identifying important variables and efficiently estimating their effects are very challenging when p n Classical methods Best subset selection with AIC and BIC (L 0 -regularization): NP-complexity High collinearity, spurious correlation, and noise accumulation (Fan and Lv, 2008 & 2010) Concentration phenomenon of high dimension, low sample size data (Hall, Marron and Neeman, 2005) Impacts of dimensionality: probabilistic and non-probabilistic views (Lv, 2013) Jinchi Lv, USC Marshall p. 5/59

6 High-Dimensional Sparse Modeling A common strategy for estimating covariate effects is fitting a parametric model of p dimensions Issues of overfitting and model identifiability High-dimensional sparse modeling using regularization methods Empirical risk function + p p λ ( β j ) j=1 Convex penalties: Lasso (Tibshirani, 1996), Adaptive Lasso (Zou, 2006), Group Lasso (Yuan and Lin, 2006),... Concave penalties: SCAD (Fan, 1997; Fan and Li, 2001), MCP (Zhang, 2010), SICA (Lv and Fan, 2009),... Jinchi Lv, USC Marshall p. 6/59

7 Continued Related L 1 -regularization method: Dantzig selector (Candes and Tao, 07) Approximately equivalent to Lasso (Bickel, Ritov and Tsybakov, 2009) Exactly equivalent to Lasso (Meinshausen, Rocha and Yu, 2007; James, Radchenko and Lv, 2009) Asymptotic equivalence of hard-thresholding and L 0 -regularization (Zheng, Fan and Lv, 2014) Understanding the connections and differences of different regularization methods is important both theoretically and empirically Jinchi Lv, USC Marshall p. 7/59

8 Penalty Functions p λ ( θ ) SCAD MCP Hard Soft p λ ( θ ) SCAD MCP Hard Soft θ θ Jinchi Lv, USC Marshall p. 8/59

9 A Motivating Example Linear model: 100 simulations with (s,n,p) = (7,60,1000) and β 0 = (1, 0.5,0.7, 1.2, 0.9,0.3,0.55, 0 T ) T ; rows of X sampled as i.i.d. copies from N(0,I p ) and independently ε N(0,0.3 2 I n ) Probability of successfully recovering supp(β 0 ) Method Lasso SCAD Success % 1% 77% Jinchi Lv, USC Marshall p. 9/59

10 Continued An illustrative solution path 1.5 Lasso 1.5 SCAD Jinchi Lv, USC Marshall p. 10/59

11 Continued Will both methods share any similarity when viewed in an appropriate way? Method Lasso t SCAD t Success % 92% 95% An illustrative solution path Lasso t SCAD t Jinchi Lv, USC Marshall p. 11/59

12 Questions of Interest What are the connections and differences of all regularization methods? Will they be essentially the same under certain measures? Is there any interesting phase transition phenomenon? Jinchi Lv, USC Marshall p. 12/59

13 Regularization Methods in Thresholded Parameter Space Jinchi Lv, USC Marshall p. 13/59

14 Model Setup (x i,y i ) n i=1 : n independent observations from (x,y) in the generalized linear model (GLM) linking a p-dimensional predictor vector x to a scalar response Y With canonical link, the conditional distribution of Y given x has density f(y;θ,φ) = exp{yθ b(θ)+c(y,φ)}, where θ = x T β with β a p-dimensional regression coefficient vector, b( ) and c(, ) are known functions, and φ is dispersion parameter β 0 = (β 0,1,,β 0,p ) T is sparse with many zero components We allow logp = O(n a ) for some 0 < a < 1 Jinchi Lv, USC Marshall p. 14/59

15 Regularization Methods Log-likelihood function l n (β) = n { i=1 yi x T i β b(xt i β)+c(y i,φ) } Penalized negative log-likelihood Q n (β) = n 1{ y T Xβ 1 T b(xβ) } + p λ (β) 1, where y = (y 1,,y n ) T, X = (x 1,, x n ) T, b(θ) = (b(θ 1 ),,b(θ n )) T with θ = (θ 1,,θ n ) T, and p λ (β) 1 = p j=1 p λ( β j ) Each column of X is rescaled to have L 2 -norm n 1/2 Investigated in, e.g., Fan and Li (2001), Fan and Peng (2004), van de Geer (2008), Fan and Lv (2011),... Jinchi Lv, USC Marshall p. 15/59

16 Robust Spark The robust spark κ c of the n p design matrix X is defined as the smallest possible positive integer such that there exists an n κ c submatrix of n 1/2 X having a singular value less than a given positive constant c (Zheng, Fan and Lv, 2014) Bounding sparse model size to control collinearity and ensure model identifiability and stability This concept generalizes spark introduced in Donoho and Elad (2003) κ c n+1. As c 0+, κ c approaches the spark Jinchi Lv, USC Marshall p. 16/59

17 Continued Robust spark can be some large number diverging with n The order of κ c when X is generated from Gaussian distribution: Proposition 1. Assumelogp = o(n) and that the rows of then prandom design matrix X are independent and identically distributed (i.i.d.) as N(0, Σ), where Σ has smallest eigenvalue bounded from below by some positive constant. Then there exist positive constants c and c such that with asymptotic probability one,κ c cn/(logp). Jinchi Lv, USC Marshall p. 17/59

18 Thresholded Parameter Space We introduce a thresholded parameter space B τ,c = {β R p : β 0 < κ c /2 and for each j, β j = 0 or β j τ}, where β = (β 1,,β p ) T and τ is some positive threshold on parameter magnitude τ is key to distinguishing between important covariates and noise covariates for the purpose of variable selection τ typically needs to satisfy τ n/(logp) as n Motivated by best subset regression with L 0 -regularization, oracle risk inequalities under prediction loss (Barron, Birge and Massart, 1999) Jinchi Lv, USC Marshall p. 18/59

19 Artificial or Natural? Hard-thresholding property: Proposition 2. For thel 0 -penaltyp λ (t) = λ1 {t 0}, the global minimizer β = ( β 1,, β p ) T of the regularization problem overr p satisfies that each component β j is either 0 or has magnitude larger than some positive threshold. Shared by many other penalties such as hard-thresdholding and SICA penalties If some covariates have weak effects, we can keep them out of the model to improve prediction accuracy with reduced estimation variability Weak signals are generally difficult to stand out compared to noise variables due to impact of high dimensionality Jinchi Lv, USC Marshall p. 19/59

20 Asymptotic Equivalence of Regularization Methods Jinchi Lv, USC Marshall p. 20/59

21 Two Key Events Consider a universal λ = c 0 (logp)/n, with c0 > 0 and p understood implicitly as n p Two key events: E = { n 1 X T ε λ/2 } and E 0 = } { n 1 X T α0 ε c 0 (logn)/n, with X α a submatrix of X consisting of columns in α, α 0 = supp(β 0 ), and ε = (ε 1,,ε n ) T = Y EY Jinchi Lv, USC Marshall p. 21/59

22 Technical Conditions Condition 1 (Error tail distribution): P(E c ) = O(p c 1 ) and P(E c 0) = O(n c 1 ) for some positive constant c 1 that can be sufficiently large for large enough c 0 Condition 2 (Bounded variance): b(θ) satisfies that c 2 b (θ) c 1 2 in its domain, where c 2 is some positive constant Condition 3 (Concave penalty function): p λ (t) is increasing and concave in t [0, ) with p λ (0) = 0, and is differentiable with p λ (0+) = c 3λ for some positive constant c 3 (Lv and Fan, 2009; Fan and Lv, 2011) A wide class of penalties, including L 1 -penalty in Lasso, SCAD, MCP, and SICA, satisfy Condition 3 Jinchi Lv, USC Marshall p. 22/59

23 Continued Condition 4 (Ultra-high dimensionality): logp = O(n a ) for some constant a (0,1) Condition 5 (True parameter vector): s = o(n 1 a ) and there exists a constant c > 0 such that the robust spark κ c > 2s. Moreover, min 1 j s β 0,j (logp)/n Jinchi Lv, USC Marshall p. 23/59

24 Oracle Inequalities of Global Minimizer Any global minimizer β = argmin β Bτ Q n (β) Oracle inequalities were established in Bickel, Ritov and Tsybakov (2009) to study asymptotic equivalence of Lasso and Dantzig selector Total number of falsely discovered signs { FS( β) = j : sgn( β j ) sgn(β 0,j ),1 j p}, where β = ( β 1,, β p ) T Jinchi Lv, USC Marshall p. 24/59

25 Continued Theorem 1 (Oracle inequalities). Assume that Conditions 1 5 hold and τ is chosen such thatτ < min 1 j s β 0,j andλ = c 0 (logp)/n = o(τ). Then the global minimizer defined exists, and any such global minimizer satisfies that with probability at least1 O(p c 1 ), it holds simultaneously that: (a) (False signs). FS( β) Csλ 2 τ 2 /(1 Cλ 2 τ 2 ); (b) (Estimation losses). β β 0 q Cλs 1/q (1 Cλ 2 τ 2 ) 1/q for each q [1,2] and β β 0 Cλs 1/2 (1 Cλ 2 τ 2 ) 1/2 ; (c) (Prediction loss). n 1/2 X( β β 0 ) 2 Cλs 1/2 (1 Cλ 2 τ 2 ) 1/2, where C is some positive constant. Jinchi Lv, USC Marshall p. 25/59

26 Continued Results hold uniformly over the set of all possible global minimizers c 1 in probability bound can be chosen arbitrarily large, affecting only C FS( β) = o(s) since λ = o(τ), while β Lasso 0 = O(φ max s) with φ max largest eigenvalue of n 1 X T X (Bickel, Ritov and Tsybakov, 2009) { for each q [1,2], β β 0 q = O s 1/q } (logp)/n and n 1/2 X( β β 0 ) 2 = O( s(logp)/n), convergence rates consistent with those in Bickel, Ritov and Tsybakov (2009) for Lasso Jinchi Lv, USC Marshall p. 26/59

27 Continued Theorem 2 (Sign consistency and oracle inequalities). Assume that conditions of Theorem 1 hold withmin 1 j s β 0,j 2τ,λ = c 0 (logp)/n = o(s 1/2 τ), andγ n = o { τ n/(slogn) }. Then any global minimizer β defined satisfies that with probability at least1 O(n c 1 ), it holds simultaneously that: (a) (Sign consistency). sgn( β) = sgn(β 0 ); (b) (Estimation and prediction losses). If the penalty function further satisfies p λ (τ) = O{ (logn)/n }, then we have for eachq [1,2], β β 0 q Cs 1/q (logn)/n, β β 0 Cγn (logn)/n, andn 1 D( β) Cs(logn)/n, whereγn is a constant showing the behavior of {1 n XT α 0 H(β 1,,β n )X α0 } 1 in a small neighborhood ofβ 0,D( β) is the Kullback-Leibler divergence, andc is some positive constant. Jinchi Lv, USC Marshall p. 27/59

28 Continued In linear model, γn = 1 ) 1 ( n XT α X 0 α 0 and γ n = sup α {s+1,,p} and α s 1 n XT α 0 X α γ n s 1/2 (1 n XT α 0 X α0 ) 1 2 c 1 s 1/2 When all true covariates are orthogonal to each other, γ n = 1 and β β 0 C (logn)/n, within a logarithmic factor logn of oracle rate Condition p λ (τ) = O{ (logn)/n } can be easily satisfied by concave penalties such as SCAD and SICA, having convergence rates improved with logn in place of logp Jinchi Lv, USC Marshall p. 28/59

29 Phase Transition Phenomenon Combining Theorems 1 and 2 shows that for p = O(n a ), Lasso and concave regularization methods are asymptotically equivalent, having the same convergence rates in the oracle inequalities, with a logarithmic factor of log n For logp = O(n a ), concave regularization methods are asymptotically equivalent and still enjoy the same convergence rates in the oracle inequalities, with a logarithmic factor of log n. For Lasso, the condition p λ (τ) = O{ (logn)/n } and the choice of λ = c 0 (logp)/n are incompatible with each other in this case Jinchi Lv, USC Marshall p. 29/59

30 Continued In ultra-high dimensional case, the convergence rates for Lasso, which have a logarithmic factor of logp, are slower than those for concave regularization methods An interesting phase diagram on how the performance of regularization methods, in the thresholded parameter space, evolves with dimensionality and penalty function Jinchi Lv, USC Marshall p. 30/59

31 Oracle Risk Inequalities of Global Minimizer Theorem 3 (Oracle risk inequalities). Assume that conditions of Theorem 2 hold and the fourth moments of errorseε 4 i are uniformly bounded. Then any global minimizer β defined satisfies that: (a) (Sign risk). E { FS( β) } = 1 p λ (τ){ [ pλ (β 0 ) 1 +sλ 2 ]O(n c 1 )+O(p c 1/2 κ c ) } ; (b) (Estimation and prediction risks). If the penalty function further satisfies p λ (τ) = O{ (logn)/n }, then we have for eachq [1,2], E β β 0 q q Cs [ (logn)/n ] q/2, E β β0 Cγ n (logn)/n, ande { n 1 D( β) } Cs(logn)/n, where C is some positive constant. Jinchi Lv, USC Marshall p. 31/59

32 Continued E { FS( β) } converges to zero at a polynomial rate of n Consistent with the risk bounds O{s(log n)/n} of the regularized estimators under the L 2 -loss in wavelets setting with orthogonal design (Antoniadis and Fan, 2001) No additional cost in risk bounds for generalizing to the ultra-high dimensional nonlinear model setting of GLM Jinchi Lv, USC Marshall p. 32/59

33 Sampling Properties of Computable Solutions How about computable solution produced by any algorithm which may not be the global minimizer? A computable solution produced by any algorithm can share the same nice asymptotic properties as for any global minimizer, when the maximum correlation between the covariates and the residual vector y µ(x β) is a smaller order of the threshold τ, where µ(θ) = (b (θ 1 ),,b (θ n )) T Jinchi Lv, USC Marshall p. 33/59

34 Continued Theorem 4. Let β B τ be a computable solution to the minimization problem produced by any algorithm that is the global minimizer when constrained on the subspace given bysupp( β), andη n = n 1 X T [y µ(x β)]. Assume in addition that there exists some positive constantc 4 such that n 1 X T α[µ(xβ) µ(xβ 0 )] 2 c 4 β β 0 2 for anyβ B τ and α = supp(β) supp(β 0 ), if the model is nonlinear. Ifη n +λ = o(τ) and min 1 j s β 0,j > c 5 s 1/2 (η n +λ) withc 5 some sufficiently large positive constant, then β enjoys the same asymptotic properties as for any global minimizer in Theorems 1 3 under the same conditions therein. Jinchi Lv, USC Marshall p. 34/59

35 Implementation Jinchi Lv, USC Marshall p. 35/59

36 Implementation Lasso-type methods: LARS algorithm (Efron et al., 2004); nonconcave penalized likelihood methods: LQA algorithm (Fan and Li, 2001) and LLA algorithm (Zou and Li, 2008) An alternative algorithm for solving large-scale problems: coordinate optimization (Friedman et al., 2007; Wu and Lange, 2008) ICA algorithm (Fan and Lv, 2011): implementing nonconcave penalized likelihood methods with second-order quadratic approximation of likelihood function and coordinate optimization (see, e.g., Lin and Lv (2013) for convergence analysis) Jinchi Lv, USC Marshall p. 36/59

37 ICA Algorithm For each coordinate within each iteration, we solve the univariate penalized least-squares problem with the corresponding quadratic approximation of the likelihood function, and update this coordinate only when the global minimizer has magnitude above the given threshold τ Thresholded parameter space naturally puts a constraint on each component Thresholding also induces additional sparsity of regularized estimate, making the algorithm converge faster Jinchi Lv, USC Marshall p. 37/59

38 Algorithm Stability Assume p λ (t) has maximum concavity ρ(p λ ) = sup 0<t1 <t 2 < { p λ (t 2) p λ (t 1) t 2 t 1 } < cc 2, with constants c and c 2 It ensures that Q n (β) is strictly convex on a union of coordinate subspaces {β R p : β 0 < κ c }, which is key to stability of sparse solution found by any algorithm This condition holds for many concave penalties For example, L 1 -penalty p λ (t) = λt in Lasso has maximum concavity 0, SCAD p λ (t) has ρ(p λ ) = (a 1) 1, and SICA p λ (t;a) = λ(a+1)t/(a+t) has maximum concavity 2λ(a 1 +a 2 ) Jinchi Lv, USC Marshall p. 38/59

39 Numerical Studies Jinchi Lv, USC Marshall p. 39/59

40 Simulation setting GLMs with p n 100 simulations with data generated from the linear model, logistic model, and Poisson model, respectively; the mean response vector depends on Xβ 0 The rows of X were sampled as i.i.d. copies from N(0,Σ) with Σ = (r j k ) 1 j,k p for some number r In linear model, error ε N(0,σ 2 I n ) independent of X, with n = 100, σ = 0.4, and β 0 = (1, 0.5,0.7, 1.2, 0.9,0.5,0.55, 0,,0) T Considered (p, r) = (1000, 0.25), (1000, 0.5), and (5000, 0.25) Jinchi Lv, USC Marshall p. 40/59

41 Continued Compared the Lasso, SCAD, and SICA in the thresholded parameter space, as well as SCAD, with oracle procedure Choose τ = c 6 (logn) 1/2 (logp)/n for some positive constant c 6 Since main purpose of our simulation study is to justify theoretical results, we use an independent validation set with size equal to sample size to select tuning parameters Jinchi Lv, USC Marshall p. 41/59

42 Continued Performance measures Prediction error (PE): E(Y x T β) 2 with β an estimate and (x T,Y) an independent observation; calculated based on independent test sample of size 10, 000 L q -losses: β β 0 q with q = 2,1, and FP: number of falsely selected noise covariates FN: number of missed true covariates Model selection consistency probability based on 100 simulations Estimate σ of error standard deviation σ Jinchi Lv, USC Marshall p. 42/59

43 Simulation results: linear model PE Lasso_t SCAD SCAD_t SICA_t Oracle L 2 loss Lasso_t SCAD SCAD_t SICA_t Oracle FP 20 FN Lasso_t SCAD SCAD_t SICA_t Oracle 0.5 Lasso_t SCAD SCAD_t SICA_t Oracle Figure 1: Boxplots of the PE, L 2 -loss, FP, and FN over 100 simulations for all methods in Section 4.2.1, with (p, r) = (5000, 0.25). The x-axis represents different methods. Jinchi Lv, USC Marshall p. 43/59

44 Continued Table 1: The means and standard errors (in parentheses) of various performance measures as well as the estimated error standard deviation for all methods in Section 4.2.1; settings I, II, and III refer to cases of (p,r) = (1000,0.25), (1000,0.5), and (5000,0.25), respectively Measure Method Lasso t SCAD SCAD t SICA t Oracle Setting I PE ( 0.1) (0.007) (0.007) (0.007) (0.007) (0.007) L 2 -loss ( 0.1) (0.032) (0.030) (0.030) (0.031) (0.031) L 1 -loss ( 0.1) (0.077) (0.100) (0.071) (0.071) (0.071) L -loss ( 0.01) 7.48 (0.24) 7.67 (0.21) 7.61 (0.21) 7.55 (0.23) 7.55 (0.23) FP 0.01 (0.01) 3.84 (0.47) 0 (0) 0 (0) 0 (0) FN 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) σ ( 0.1) (0.035) (0.034) (0.034) (0.034) (0.034) Setting II PE ( 0.1) (0.045) (0.008) (0.008) (0.019) (0.007) L 2 -loss ( 0.1) (0.100) (0.039) (0.040) (0.062) (0.038) L 1 -loss ( 0.1) (0.318) (0.118) (0.108) 2.957(0.132) (0.088) L -loss ( 0.01) 9.42 (0.65) 8.99 (0.28) 8.95 (0.28) 9.22 (0.49) 8.76 (0.26) FP 0.22 (0.18) 4.11 (0.48) 0.56 (0.12) 0.01 (0.01) 0 (0) FN 0.01 (0.01) 0 (0) 0 (0) 0.01 (0.01) 0 (0) σ ( 0.1) (0.033) (0.034) (0.036) (0.035) (0.034) Setting III PE ( 0.1) (0.008) (0.008) (0.007) (0.008) (0.006) L 2 -loss ( 0.1) (0.034) (0.033) (0.032) (0.034) (0.031) L 1 -loss ( 0.1) (0.074) (0.139) (0.071) (0.075) (0.070) L -loss ( 0.01) 7.77 (0.28) 7.79 (0.24) 7.61 (0.26) 7.80 (0.29) 7.43 (0.24) FP 0.02 (0.01) 8.25 (0.84) 0.01 (0.01) 0.06 (0.02) 0 (0) FN 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) σ ( 0.1) (0.032) (0.034) (0.031) (0.032) (0.031) Jinchi Lv, USC Marshall p. 44/59

45 Continued Table 2: Model selection consistency probabilities of all methods in Section Setting of (p, r) Model selection consistency probability Lasso t SCAD SCAD t SICA t Oracle (1000, 0.25) (1000, 0.5) (5000, 0.25) Jinchi Lv, USC Marshall p. 45/59

46 Continued Table 3: The means and standard errors (in parentheses) of various performance measures as well as the estimated error standard deviation for all methods in Section with (p,r) = (5000,0.5); settings I, II, and III refer to cases of n = 100, 200, and 400, respectively Measure Method Lasso t SCAD SCAD t SICA t Oracle Setting I PE ( 0.1) (0.215) (0.105) (0.062) (0.134) (0.006) L 2 -loss ( 0.1) (0.343) (0.187) (0.126) (0.243) (0.039) L 1 -loss ( 0.1) (0.841) (0.523) (0.280) (0.535) (0.089) L -loss ( 0.01) (2.21) (1.25) (0.91) (1.68) 8.63 (0.29) FP 0.19 (0.07) (1.00) 0.91 (0.17) 0.08 (0.03) 0 (0) FN 0.41 (0.09) 0.06 (0.03) 0.05 (0.04) 0.21 (0.07) 0 (0) σ ( 0.1) (0.111) (0.061) (0.050) (0.082) (0.031) Setting II PE ( 0.1) (0.004) (0.004) (0.004) (0.004) (0.004) L 2 -loss ( 0.1) (0.034) (0.032) (0.031) (0.031) (0.031) L 1 -loss ( 0.1) (0.079) (0.141) (0.071) 1.952(0.071) (0.072) L -loss ( 0.01) 6.08 (0.22) 6.21 (0.22) 6.19 (0.22) 5.98(0.22) 6.00 (0.22) FP 0 (0) 4.82 (1.23) 0 (0) 0 (0) 0 (0) FN 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) σ ( 0.1) (0.020) (0.023) (0.020) (0.020) (0.020) Setting III PE ( 0.1) (0.003) (0.003) (0.003) (0.003) (0.003) L 2 -loss ( 0.1) (0.020) (0.020) (0.019) (0.019) (0.019) L 1 -loss ( 0.1) (0.048) (0.084) (0.045) (0.044) (0.044) L -loss ( 0.01) 4.39 (0.13) 4.43 (0.13) 4.42 (0.13) 4.31 (0.13) 4.36 (0.13) FP 0 (0) 3.67 (1.11) 0 (0) 0 (0) 0 (0) FN 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) σ ( 0.1) (0.012) (0.013) (0.012) (0.012) (0.012) Jinchi Lv, USC Marshall p. 46/59

47 Continued Table 4: Model selection consistency probabilities of all methods in Section with(p, r) = (5000, 0.5) n Model selection consistency probability Lasso t SCAD SCAD t SICA t Oracle Jinchi Lv, USC Marshall p. 47/59

48 Simulation results: logistic model In logistic model, response vector y was sampled from Bernoulli distribution with success probability vector (e θ 1 /(1+e θ 1 ),,e θ n /(1+e θ n )) T, with (θ 1,,θ n ) T = Xβ 0 n = 200 and β 0 = (2,0, 2.3,0,2.8,0, 2.2,0,2.5,0,,0) T Prediction error: E{Y exp(x T β)/[1+exp(x 2 T β)]} Jinchi Lv, USC Marshall p. 48/59

49 Continued PE L 2 loss Lasso_t SCAD SCAD_t SICA_t Oracle 0 Lasso_t SCAD SCAD_t SICA_t Oracle FP 0.4 FN Lasso_t SCAD SCAD_t SICA_t Oracle 0 Lasso_t SCAD SCAD_t SICA_t Oracle Figure 2: Boxplots of the PE, L 2 -loss, FP, and FN over 100 simulations for all methods in Section 4.2.2, with p = The x-axis represents different methods. Jinchi Lv, USC Marshall p. 49/59

50 Continued Table 5: The means and standard errors (in parentheses) of various prediction and variable selection performance measures for all methods in Section 4.2.2; settings I, II, and III refer to cases of (p,r) = (1000,0.25), (1000,0.5), and (5000,0.25), respectively Measure Method Lasso t SCAD SCAD t SICA t Oracle Setting I PE ( 0.01) 7.89 (0.03) 7.97 (0.07) 7.86 (0.03) 7.88 (0.04) 7.86 (0.03) L 2 -loss (0.039) (0.096) (0.052) (0.051) (0.049) L 1 -loss (0.087) (0.271) (0.108) (0.107) (0.103) L -loss ( 0.1) (0.224) (0.509) (0.345) (0.346) (0.333) FP 0.02 (0.01) 0.09 (0.05) 0 (0) 0.01 (0.01) 0 (0) FN 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) Setting II PE ( 0.01) 9.09 (0.07) 9.13 (0.07) 9.00 (0.05) 9.04 (0.06) 8.94 (0.03) L 2 -loss ( 0.059) (0.072) (0.059) (0.055) (0.049) L 1 -loss (0.135) (0.168) (0.129) ( 0.120) (0.103) L -loss ( 0.1) (0.334) (0.408) (0.360) (0.338) (0.314) FP 0.06 (0.02) 0.16 (0.05) 0.04 (0.02) 0.06 (0.03) 0 (0) FN 0.01 (0.01) 0 (0) 0 (0) 0 (0) 0 (0) Setting III PE ( 0.01) 7.89 (0.04) 7.96 (0.07) 7.94 (0.07) 7.98 (0.08) 7.85 (0.04) L 2 -loss (0.053) (0.089) (0.083) (0.084) (0.079) L 1 -loss (0.123) (0.199) (0.181) (0.182) (0.172) L -loss ( 0.1) (0.301) (0.568) (0.547) (0.560) (0.501) FP 0.03 (0.02) 0.03 (0.02) 0.01 (0.01) 0.01 (0.01) 0 (0) FN 0 (0) 0.02 (0.01) 0.02 (0.01) 0.03 (0.02) 0 (0) Jinchi Lv, USC Marshall p. 50/59

51 Continued Table 6: Model selection consistency probabilities of all methods in Section Setting of (p, r) Model selection consistency probability Lasso t SCAD SCAD t SICA t Oracle (1000, 0.25) (1000, 0.5) (5000, 0.25) Jinchi Lv, USC Marshall p. 51/59

52 Simulation results: Poisson model In Poisson model, response vector y was sampled from Poisson distribution with mean vector (e θ 1,,e θ n ) T, with (θ 1,,θ n ) T = Xβ 0 n = 200 and β 0 = (1, 0.9,0.8, 1.1,0.6,0,,0) T Prediction error: E[Y exp(x T β)] 2 Jinchi Lv, USC Marshall p. 52/59

53 Continued PE L 2 loss Lasso_t SCAD SCAD_t SICA_t Oracle 0 Lasso_t SCAD SCAD_t SICA_t Oracle FP FN Lasso_t SCAD SCAD_t SICA_t Oracle Lasso_t SCAD SCAD_t SICA_t Oracle Figure 3: Boxplots of the PE, L 2 -loss, FP, and FN over 100 simulations for all methods in Section 4.2.3, with p = The x-axis represents different methods. Jinchi Lv, USC Marshall p. 53/59

54 Continued Table 7: The 5% trimmed means and standard errors (in parentheses) of various prediction and variable selection performance measures for all methods in Section 4.2.3; settings I, II, and III refer to cases of (p,r) = (1000,0.25), (1000,0.5), and (5000,0.25), respectively Measure Method Lasso t SCAD SCAD t SICA t Oracle Setting I PE (2.23) (0.94) 9.00 (0.66) 7.39 (0.50) 6.22 (0.22) L 2 -loss ( 0.01) (1.44) (0.71) (0.67) 9.05 (0.45) 7.94 (0.31) L 1 -loss ( 0.1) (0.440) (0.191) (0.173) (0.086) (0.060) L -loss ( 0.01) (0.75) (0.58) 7.82 (0.43) 6.69 (0.37) 5.69 (0.23) FP 1.47 (0.28) (0.67) 1.20 (0.22) 0.07 (0.03) 0 (0) FN 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) Setting II PE (0.272) (0.065) (0.055) (0.072) (0.030) L 2 -loss ( 0.01) (1.78) (0.60) (0.69) (0.82) (0.48) L 1 -loss ( 0.1) (0.651) (0.225) (0.192) 2.357(0.143) (0.097) L -loss ( 0.01) (0.75) (0.49) 9.92 (0.45) 9.15 (0.69) 8.39 (0.35) FP 2.54 (0.23) (0.84) 1.70 (0.34) 0 (0) 0 (0) FN 0 (0) 0 (0) 0 (0) 0.01 (0.01) 0 (0) Setting III PE (3.04) (1.00) (0.71) 8.28 (0.51) 6.15 (0.20) L 2 -loss ( 0.01) (1.92) (0.55) (0.76) (0.72) 8.35 (0.31) L 1 -loss ( 0.1) (0.617) (0.175) (0.260) 2.252(0.173) (0.062) L -loss ( 0.01) (0.88) (0.48) 9.46 (0.46) 8.48 (0.46) 6.21 (0.23) FP 2.36 (0.28) (0.71) 3.06 (0.49) 0.39 (0.08) 0 (0) FN 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) Jinchi Lv, USC Marshall p. 54/59

55 Continued Table 8: Model selection consistency probabilities of all methods in Section Setting of (p, r) Model selection consistency probability Lasso t SCAD SCAD t SICA t Oracle (1000, 0.25) (1000, 0.5) (5000, 0.25) Jinchi Lv, USC Marshall p. 55/59

56 Application to genomic data Prostate cancer data set in Singh et al. (2002) consists of 136 patient samples with 77 from prostate tumor group and 59 from normal group, each patient having gene expression measurements for 12,600 genes Randomly split all samples into a training set of 52 samples from cancer class and 50 samples from normal class, and a test set of 25 samples from cancer class and 9 samples from normal class For each splitting, we fit logistic regression model to training data with regularization methods, and then calculated classification error using test data We repeated random splitting 50 times, and tuning parameters were selected by fivefold CV Jinchi Lv, USC Marshall p. 56/59

57 Continued Table 9: The means and standard errors of classification errors by different methods over 50 random splittings of the prostate cancer data in Section 4.3 Lasso t SCAD SCAD t SICA t Mean Standard error Jinchi Lv, USC Marshall p. 57/59

58 Continued Table 10: Selection probabilities of most frequently selected genes with number up to median model size by each method across 50 random splittings of the prostate cancer data in Section 4.3 Gene ID Lasso t SCAD SCAD t SICA t Gene ID Lasso t SCAD SCAD t SICA t Jinchi Lv, USC Marshall p. 58/59

59 Conclusions We have studied asymptotic equivalence of two popular classes of regularization methods, convex ones and concave ones, in high-dimensional GLMs Oracle inequalities as well as stronger oracle risk inequalities of global minimizer and sampling properties of computable solutions for regularization methods have been established to characterize their connections and differences When p = O(n a ), all regularization methods under consideration including Lasso are asymptotically equivalent, while when logp = O(n a ), concave methods are asymptotically equivalent but have faster convergence rates than Lasso Jinchi Lv, USC Marshall p. 59/59

High dimensional thresholded regression and shrinkage effect

High dimensional thresholded regression and shrinkage effect J. R. Statist. Soc. B (014) 76, Part 3, pp. 67 649 High dimensional thresholded regression and shrinkage effect Zemin Zheng, Yingying Fan and Jinchi Lv University of Southern California, Los Angeles, USA

More information

Reconstruction from Anisotropic Random Measurements

Reconstruction from Anisotropic Random Measurements Reconstruction from Anisotropic Random Measurements Mark Rudelson and Shuheng Zhou The University of Michigan, Ann Arbor Coding, Complexity, and Sparsity Workshop, 013 Ann Arbor, Michigan August 7, 013

More information

Asymptotic properties for combined L 1 and concave regularization

Asymptotic properties for combined L 1 and concave regularization Biometrika (2014), 101,1,pp. 57 70 doi: 10.1093/biomet/ast047 Printed in Great Britain Advance Access publication 13 December 2013 Asymptotic properties for combined L 1 and concave regularization BY YINGYING

More information

High-dimensional covariance estimation based on Gaussian graphical models

High-dimensional covariance estimation based on Gaussian graphical models High-dimensional covariance estimation based on Gaussian graphical models Shuheng Zhou Department of Statistics, The University of Michigan, Ann Arbor IMA workshop on High Dimensional Phenomena Sept. 26,

More information

Selection of Smoothing Parameter for One-Step Sparse Estimates with L q Penalty

Selection of Smoothing Parameter for One-Step Sparse Estimates with L q Penalty Journal of Data Science 9(2011), 549-564 Selection of Smoothing Parameter for One-Step Sparse Estimates with L q Penalty Masaru Kanba and Kanta Naito Shimane University Abstract: This paper discusses the

More information

High-dimensional regression with unknown variance

High-dimensional regression with unknown variance High-dimensional regression with unknown variance Christophe Giraud Ecole Polytechnique march 2012 Setting Gaussian regression with unknown variance: Y i = f i + ε i with ε i i.i.d. N (0, σ 2 ) f = (f

More information

Machine Learning for OR & FE

Machine Learning for OR & FE Machine Learning for OR & FE Regression II: Regularization and Shrinkage Methods Martin Haugh Department of Industrial Engineering and Operations Research Columbia University Email: martin.b.haugh@gmail.com

More information

Stepwise Searching for Feature Variables in High-Dimensional Linear Regression

Stepwise Searching for Feature Variables in High-Dimensional Linear Regression Stepwise Searching for Feature Variables in High-Dimensional Linear Regression Qiwei Yao Department of Statistics, London School of Economics q.yao@lse.ac.uk Joint work with: Hongzhi An, Chinese Academy

More information

Regularization and Variable Selection via the Elastic Net

Regularization and Variable Selection via the Elastic Net p. 1/1 Regularization and Variable Selection via the Elastic Net Hui Zou and Trevor Hastie Journal of Royal Statistical Society, B, 2005 Presenter: Minhua Chen, Nov. 07, 2008 p. 2/1 Agenda Introduction

More information

Sure Independence Screening

Sure Independence Screening Sure Independence Screening Jianqing Fan and Jinchi Lv Princeton University and University of Southern California August 16, 2017 Abstract Big data is ubiquitous in various fields of sciences, engineering,

More information

Permutation-invariant regularization of large covariance matrices. Liza Levina

Permutation-invariant regularization of large covariance matrices. Liza Levina Liza Levina Permutation-invariant covariance regularization 1/42 Permutation-invariant regularization of large covariance matrices Liza Levina Department of Statistics University of Michigan Joint work

More information

Smoothly Clipped Absolute Deviation (SCAD) for Correlated Variables

Smoothly Clipped Absolute Deviation (SCAD) for Correlated Variables Smoothly Clipped Absolute Deviation (SCAD) for Correlated Variables LIB-MA, FSSM Cadi Ayyad University (Morocco) COMPSTAT 2010 Paris, August 22-27, 2010 Motivations Fan and Li (2001), Zou and Li (2008)

More information

STAT 992 Paper Review: Sure Independence Screening in Generalized Linear Models with NP-Dimensionality J.Fan and R.Song

STAT 992 Paper Review: Sure Independence Screening in Generalized Linear Models with NP-Dimensionality J.Fan and R.Song STAT 992 Paper Review: Sure Independence Screening in Generalized Linear Models with NP-Dimensionality J.Fan and R.Song Presenter: Jiwei Zhao Department of Statistics University of Wisconsin Madison April

More information

The Adaptive Lasso and Its Oracle Properties Hui Zou (2006), JASA

The Adaptive Lasso and Its Oracle Properties Hui Zou (2006), JASA The Adaptive Lasso and Its Oracle Properties Hui Zou (2006), JASA Presented by Dongjun Chung March 12, 2010 Introduction Definition Oracle Properties Computations Relationship: Nonnegative Garrote Extensions:

More information

SOLVING NON-CONVEX LASSO TYPE PROBLEMS WITH DC PROGRAMMING. Gilles Gasso, Alain Rakotomamonjy and Stéphane Canu

SOLVING NON-CONVEX LASSO TYPE PROBLEMS WITH DC PROGRAMMING. Gilles Gasso, Alain Rakotomamonjy and Stéphane Canu SOLVING NON-CONVEX LASSO TYPE PROBLEMS WITH DC PROGRAMMING Gilles Gasso, Alain Rakotomamonjy and Stéphane Canu LITIS - EA 48 - INSA/Universite de Rouen Avenue de l Université - 768 Saint-Etienne du Rouvray

More information

ADAPTIVE LASSO FOR SPARSE HIGH-DIMENSIONAL REGRESSION MODELS

ADAPTIVE LASSO FOR SPARSE HIGH-DIMENSIONAL REGRESSION MODELS Statistica Sinica 18(2008), 1603-1618 ADAPTIVE LASSO FOR SPARSE HIGH-DIMENSIONAL REGRESSION MODELS Jian Huang, Shuangge Ma and Cun-Hui Zhang University of Iowa, Yale University and Rutgers University Abstract:

More information

The Iterated Lasso for High-Dimensional Logistic Regression

The Iterated Lasso for High-Dimensional Logistic Regression The Iterated Lasso for High-Dimensional Logistic Regression By JIAN HUANG Department of Statistics and Actuarial Science, 241 SH University of Iowa, Iowa City, Iowa 52242, U.S.A. SHUANGE MA Division of

More information

The deterministic Lasso

The deterministic Lasso The deterministic Lasso Sara van de Geer Seminar für Statistik, ETH Zürich Abstract We study high-dimensional generalized linear models and empirical risk minimization using the Lasso An oracle inequality

More information

A Bootstrap Lasso + Partial Ridge Method to Construct Confidence Intervals for Parameters in High-dimensional Sparse Linear Models

A Bootstrap Lasso + Partial Ridge Method to Construct Confidence Intervals for Parameters in High-dimensional Sparse Linear Models A Bootstrap Lasso + Partial Ridge Method to Construct Confidence Intervals for Parameters in High-dimensional Sparse Linear Models Jingyi Jessica Li Department of Statistics University of California, Los

More information

Nonconcave Penalized Likelihood with A Diverging Number of Parameters

Nonconcave Penalized Likelihood with A Diverging Number of Parameters Nonconcave Penalized Likelihood with A Diverging Number of Parameters Jianqing Fan and Heng Peng Presenter: Jiale Xu March 12, 2010 Jianqing Fan and Heng Peng Presenter: JialeNonconcave Xu () Penalized

More information

Pre-Selection in Cluster Lasso Methods for Correlated Variable Selection in High-Dimensional Linear Models

Pre-Selection in Cluster Lasso Methods for Correlated Variable Selection in High-Dimensional Linear Models Pre-Selection in Cluster Lasso Methods for Correlated Variable Selection in High-Dimensional Linear Models Niharika Gauraha and Swapan Parui Indian Statistical Institute Abstract. We consider variable

More information

Variable Selection for Highly Correlated Predictors

Variable Selection for Highly Correlated Predictors Variable Selection for Highly Correlated Predictors Fei Xue and Annie Qu Department of Statistics, University of Illinois at Urbana-Champaign WHOA-PSI, Aug, 2017 St. Louis, Missouri 1 / 30 Background Variable

More information

A Confidence Region Approach to Tuning for Variable Selection

A Confidence Region Approach to Tuning for Variable Selection A Confidence Region Approach to Tuning for Variable Selection Funda Gunes and Howard D. Bondell Department of Statistics North Carolina State University Abstract We develop an approach to tuning of penalized

More information

Orthogonal Matching Pursuit for Sparse Signal Recovery With Noise

Orthogonal Matching Pursuit for Sparse Signal Recovery With Noise Orthogonal Matching Pursuit for Sparse Signal Recovery With Noise The MIT Faculty has made this article openly available. Please share how this access benefits you. Your story matters. Citation As Published

More information

Biostatistics-Lecture 16 Model Selection. Ruibin Xi Peking University School of Mathematical Sciences

Biostatistics-Lecture 16 Model Selection. Ruibin Xi Peking University School of Mathematical Sciences Biostatistics-Lecture 16 Model Selection Ruibin Xi Peking University School of Mathematical Sciences Motivating example1 Interested in factors related to the life expectancy (50 US states,1969-71 ) Per

More information

Guarding against Spurious Discoveries in High Dimension. Jianqing Fan

Guarding against Spurious Discoveries in High Dimension. Jianqing Fan in High Dimension Jianqing Fan Princeton University with Wen-Xin Zhou September 30, 2016 Outline 1 Introduction 2 Spurious correlation and random geometry 3 Goodness Of Spurious Fit (GOSF) 4 Asymptotic

More information

Fast Regularization Paths via Coordinate Descent

Fast Regularization Paths via Coordinate Descent August 2008 Trevor Hastie, Stanford Statistics 1 Fast Regularization Paths via Coordinate Descent Trevor Hastie Stanford University joint work with Jerry Friedman and Rob Tibshirani. August 2008 Trevor

More information

INNOVATED INTERACTION SCREENING FOR HIGH-DIMENSIONAL NONLINEAR CLASSIFICATION 1

INNOVATED INTERACTION SCREENING FOR HIGH-DIMENSIONAL NONLINEAR CLASSIFICATION 1 The Annals of Statistics 2015, Vol. 43, No. 3, 1243 1272 DOI: 10.1214/14-AOS1308 Institute of Mathematical Statistics, 2015 INNOVATED INTERACTION SCREENING FOR HIGH-DIMENSIONAL NONLINEAR CLASSIFICATION

More information

arxiv: v2 [math.st] 12 Feb 2008

arxiv: v2 [math.st] 12 Feb 2008 arxiv:080.460v2 [math.st] 2 Feb 2008 Electronic Journal of Statistics Vol. 2 2008 90 02 ISSN: 935-7524 DOI: 0.24/08-EJS77 Sup-norm convergence rate and sign concentration property of Lasso and Dantzig

More information

High-dimensional Ordinary Least-squares Projection for Screening Variables

High-dimensional Ordinary Least-squares Projection for Screening Variables 1 / 38 High-dimensional Ordinary Least-squares Projection for Screening Variables Chenlei Leng Joint with Xiangyu Wang (Duke) Conference on Nonparametric Statistics for Big Data and Celebration to Honor

More information

Model-Free Knockoffs: High-Dimensional Variable Selection that Controls the False Discovery Rate

Model-Free Knockoffs: High-Dimensional Variable Selection that Controls the False Discovery Rate Model-Free Knockoffs: High-Dimensional Variable Selection that Controls the False Discovery Rate Lucas Janson, Stanford Department of Statistics WADAPT Workshop, NIPS, December 2016 Collaborators: Emmanuel

More information

arxiv: v1 [stat.me] 30 Dec 2017

arxiv: v1 [stat.me] 30 Dec 2017 arxiv:1801.00105v1 [stat.me] 30 Dec 2017 An ISIS screening approach involving threshold/partition for variable selection in linear regression 1. Introduction Yu-Hsiang Cheng e-mail: 96354501@nccu.edu.tw

More information

A UNIFIED APPROACH TO MODEL SELECTION AND SPARS. REGULARIZED LEAST SQUARES by Jinchi Lv and Yingying Fan The annals of Statistics (2009)

A UNIFIED APPROACH TO MODEL SELECTION AND SPARS. REGULARIZED LEAST SQUARES by Jinchi Lv and Yingying Fan The annals of Statistics (2009) A UNIFIED APPROACH TO MODEL SELECTION AND SPARSE RECOVERY USING REGULARIZED LEAST SQUARES by Jinchi Lv and Yingying Fan The annals of Statistics (2009) Mar. 19. 2010 Outline 1 2 Sideline information Notations

More information

De-biasing the Lasso: Optimal Sample Size for Gaussian Designs

De-biasing the Lasso: Optimal Sample Size for Gaussian Designs De-biasing the Lasso: Optimal Sample Size for Gaussian Designs Adel Javanmard USC Marshall School of Business Data Science and Operations department Based on joint work with Andrea Montanari Oct 2015 Adel

More information

An efficient ADMM algorithm for high dimensional precision matrix estimation via penalized quadratic loss

An efficient ADMM algorithm for high dimensional precision matrix estimation via penalized quadratic loss An efficient ADMM algorithm for high dimensional precision matrix estimation via penalized quadratic loss arxiv:1811.04545v1 [stat.co] 12 Nov 2018 Cheng Wang School of Mathematical Sciences, Shanghai Jiao

More information

Bi-level feature selection with applications to genetic association

Bi-level feature selection with applications to genetic association Bi-level feature selection with applications to genetic association studies October 15, 2008 Motivation In many applications, biological features possess a grouping structure Categorical variables may

More information

Learning with Sparsity Constraints

Learning with Sparsity Constraints Stanford 2010 Trevor Hastie, Stanford Statistics 1 Learning with Sparsity Constraints Trevor Hastie Stanford University recent joint work with Rahul Mazumder, Jerome Friedman and Rob Tibshirani earlier

More information

Variable Screening in High-dimensional Feature Space

Variable Screening in High-dimensional Feature Space ICCM 2007 Vol. II 735 747 Variable Screening in High-dimensional Feature Space Jianqing Fan Abstract Variable selection in high-dimensional space characterizes many contemporary problems in scientific

More information

Sample Size Requirement For Some Low-Dimensional Estimation Problems

Sample Size Requirement For Some Low-Dimensional Estimation Problems Sample Size Requirement For Some Low-Dimensional Estimation Problems Cun-Hui Zhang, Rutgers University September 10, 2013 SAMSI Thanks for the invitation! Acknowledgements/References Sun, T. and Zhang,

More information

RANK: Large-Scale Inference with Graphical Nonlinear Knockoffs

RANK: Large-Scale Inference with Graphical Nonlinear Knockoffs RANK: Large-Scale Inference with Graphical Nonlinear Knockoffs Yingying Fan 1, Emre Demirkaya 1, Gaorong Li 2 and Jinchi Lv 1 University of Southern California 1 and Beiing University of Technology 2 October

More information

BAGUS: Bayesian Regularization for Graphical Models with Unequal Shrinkage

BAGUS: Bayesian Regularization for Graphical Models with Unequal Shrinkage BAGUS: Bayesian Regularization for Graphical Models with Unequal Shrinkage Lingrui Gan, Naveen N. Narisetty, Feng Liang Department of Statistics University of Illinois at Urbana-Champaign Problem Statement

More information

Chapter 3. Linear Models for Regression

Chapter 3. Linear Models for Regression Chapter 3. Linear Models for Regression Wei Pan Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN 55455 Email: weip@biostat.umn.edu PubH 7475/8475 c Wei Pan Linear

More information

Lecture 14: Variable Selection - Beyond LASSO

Lecture 14: Variable Selection - Beyond LASSO Fall, 2017 Extension of LASSO To achieve oracle properties, L q penalty with 0 < q < 1, SCAD penalty (Fan and Li 2001; Zhang et al. 2007). Adaptive LASSO (Zou 2006; Zhang and Lu 2007; Wang et al. 2007)

More information

The Double Dantzig. Some key words: Dantzig Selector; Double Dantzig; Generalized Linear Models; Lasso; Variable Selection.

The Double Dantzig. Some key words: Dantzig Selector; Double Dantzig; Generalized Linear Models; Lasso; Variable Selection. The Double Dantzig GARETH M. JAMES AND PETER RADCHENKO Abstract The Dantzig selector (Candes and Tao, 2007) is a new approach that has been proposed for performing variable selection and model fitting

More information

Homogeneity Pursuit. Jianqing Fan

Homogeneity Pursuit. Jianqing Fan Jianqing Fan Princeton University with Tracy Ke and Yichao Wu http://www.princeton.edu/ jqfan June 5, 2014 Get my own profile - Help Amazing Follow this author Grace Wahba 9 Followers Follow new articles

More information

TECHNICAL REPORT NO. 1091r. A Note on the Lasso and Related Procedures in Model Selection

TECHNICAL REPORT NO. 1091r. A Note on the Lasso and Related Procedures in Model Selection DEPARTMENT OF STATISTICS University of Wisconsin 1210 West Dayton St. Madison, WI 53706 TECHNICAL REPORT NO. 1091r April 2004, Revised December 2004 A Note on the Lasso and Related Procedures in Model

More information

ADAPTIVE LASSO FOR SPARSE HIGH-DIMENSIONAL REGRESSION MODELS. November The University of Iowa. Department of Statistics and Actuarial Science

ADAPTIVE LASSO FOR SPARSE HIGH-DIMENSIONAL REGRESSION MODELS. November The University of Iowa. Department of Statistics and Actuarial Science ADAPTIVE LASSO FOR SPARSE HIGH-DIMENSIONAL REGRESSION MODELS Jian Huang 1, Shuangge Ma 2, and Cun-Hui Zhang 3 1 University of Iowa, 2 Yale University, 3 Rutgers University November 2006 The University

More information

Variable selection in high-dimensional regression problems

Variable selection in high-dimensional regression problems Variable selection in high-dimensional regression problems Jan Polish Academy of Sciences and Warsaw University of Technology Based on joint research with P. Pokarowski, A. Prochenka, P. Teisseyre and

More information

Generalized Elastic Net Regression

Generalized Elastic Net Regression Abstract Generalized Elastic Net Regression Geoffroy MOURET Jean-Jules BRAULT Vahid PARTOVINIA This work presents a variation of the elastic net penalization method. We propose applying a combined l 1

More information

Variable Selection for Highly Correlated Predictors

Variable Selection for Highly Correlated Predictors Variable Selection for Highly Correlated Predictors Fei Xue and Annie Qu arxiv:1709.04840v1 [stat.me] 14 Sep 2017 Abstract Penalty-based variable selection methods are powerful in selecting relevant covariates

More information

Semi-Penalized Inference with Direct FDR Control

Semi-Penalized Inference with Direct FDR Control Jian Huang University of Iowa April 4, 2016 The problem Consider the linear regression model y = p x jβ j + ε, (1) j=1 where y IR n, x j IR n, ε IR n, and β j is the jth regression coefficient, Here p

More information

The lasso, persistence, and cross-validation

The lasso, persistence, and cross-validation The lasso, persistence, and cross-validation Daniel J. McDonald Department of Statistics Indiana University http://www.stat.cmu.edu/ danielmc Joint work with: Darren Homrighausen Colorado State University

More information

Confidence Intervals for Low-dimensional Parameters with High-dimensional Data

Confidence Intervals for Low-dimensional Parameters with High-dimensional Data Confidence Intervals for Low-dimensional Parameters with High-dimensional Data Cun-Hui Zhang and Stephanie S. Zhang Rutgers University and Columbia University September 14, 2012 Outline Introduction Methodology

More information

A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables

A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables Niharika Gauraha and Swapan Parui Indian Statistical Institute Abstract. We consider the problem of

More information

Generalized Linear Models and Its Asymptotic Properties

Generalized Linear Models and Its Asymptotic Properties for High Dimensional Generalized Linear Models and Its Asymptotic Properties April 21, 2012 for High Dimensional Generalized L Abstract Literature Review In this talk, we present a new prior setting for

More information

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Sparse regression. Optimization-Based Data Analysis.   Carlos Fernandez-Granda Sparse regression Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda 3/28/2016 Regression Least-squares regression Example: Global warming Logistic

More information

ISyE 691 Data mining and analytics

ISyE 691 Data mining and analytics ISyE 691 Data mining and analytics Regression Instructor: Prof. Kaibo Liu Department of Industrial and Systems Engineering UW-Madison Email: kliu8@wisc.edu Office: Room 3017 (Mechanical Engineering Building)

More information

Robust Variable Selection Through MAVE

Robust Variable Selection Through MAVE Robust Variable Selection Through MAVE Weixin Yao and Qin Wang Abstract Dimension reduction and variable selection play important roles in high dimensional data analysis. Wang and Yin (2008) proposed sparse

More information

Lecture 5: Soft-Thresholding and Lasso

Lecture 5: Soft-Thresholding and Lasso High Dimensional Data and Statistical Learning Lecture 5: Soft-Thresholding and Lasso Weixing Song Department of Statistics Kansas State University Weixing Song STAT 905 October 23, 2014 1/54 Outline Penalized

More information

Sparsity Regularization

Sparsity Regularization Sparsity Regularization Bangti Jin Course Inverse Problems & Imaging 1 / 41 Outline 1 Motivation: sparsity? 2 Mathematical preliminaries 3 l 1 solvers 2 / 41 problem setup finite-dimensional formulation

More information

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010 Linear Regression Aarti Singh Machine Learning 10-701/15-781 Sept 27, 2010 Discrete to Continuous Labels Classification Sports Science News Anemic cell Healthy cell Regression X = Document Y = Topic X

More information

Direct Learning: Linear Regression. Donglin Zeng, Department of Biostatistics, University of North Carolina

Direct Learning: Linear Regression. Donglin Zeng, Department of Biostatistics, University of North Carolina Direct Learning: Linear Regression Parametric learning We consider the core function in the prediction rule to be a parametric function. The most commonly used function is a linear function: squared loss:

More information

Comparisons of penalized least squares. methods by simulations

Comparisons of penalized least squares. methods by simulations Comparisons of penalized least squares arxiv:1405.1796v1 [stat.co] 8 May 2014 methods by simulations Ke ZHANG, Fan YIN University of Science and Technology of China, Hefei 230026, China Shifeng XIONG Academy

More information

Lecture 24 May 30, 2018

Lecture 24 May 30, 2018 Stats 3C: Theory of Statistics Spring 28 Lecture 24 May 3, 28 Prof. Emmanuel Candes Scribe: Martin J. Zhang, Jun Yan, Can Wang, and E. Candes Outline Agenda: High-dimensional Statistical Estimation. Lasso

More information

Estimators based on non-convex programs: Statistical and computational guarantees

Estimators based on non-convex programs: Statistical and computational guarantees Estimators based on non-convex programs: Statistical and computational guarantees Martin Wainwright UC Berkeley Statistics and EECS Based on joint work with: Po-Ling Loh (UC Berkeley) Martin Wainwright

More information

A Practical Scheme and Fast Algorithm to Tune the Lasso With Optimality Guarantees

A Practical Scheme and Fast Algorithm to Tune the Lasso With Optimality Guarantees Journal of Machine Learning Research 17 (2016) 1-20 Submitted 11/15; Revised 9/16; Published 12/16 A Practical Scheme and Fast Algorithm to Tune the Lasso With Optimality Guarantees Michaël Chichignoud

More information

By Gareth M. James, Jing Wang and Ji Zhu University of Southern California, University of Michigan and University of Michigan

By Gareth M. James, Jing Wang and Ji Zhu University of Southern California, University of Michigan and University of Michigan The Annals of Statistics 2009, Vol. 37, No. 5A, 2083 2108 DOI: 10.1214/08-AOS641 c Institute of Mathematical Statistics, 2009 FUNCTIONAL LINEAR REGRESSION THAT S INTERPRETABLE 1 arxiv:0908.2918v1 [math.st]

More information

Consistent high-dimensional Bayesian variable selection via penalized credible regions

Consistent high-dimensional Bayesian variable selection via penalized credible regions Consistent high-dimensional Bayesian variable selection via penalized credible regions Howard Bondell bondell@stat.ncsu.edu Joint work with Brian Reich Howard Bondell p. 1 Outline High-Dimensional Variable

More information

Analysis Methods for Supersaturated Design: Some Comparisons

Analysis Methods for Supersaturated Design: Some Comparisons Journal of Data Science 1(2003), 249-260 Analysis Methods for Supersaturated Design: Some Comparisons Runze Li 1 and Dennis K. J. Lin 2 The Pennsylvania State University Abstract: Supersaturated designs

More information

Fast Regularization Paths via Coordinate Descent

Fast Regularization Paths via Coordinate Descent KDD August 2008 Trevor Hastie, Stanford Statistics 1 Fast Regularization Paths via Coordinate Descent Trevor Hastie Stanford University joint work with Jerry Friedman and Rob Tibshirani. KDD August 2008

More information

Convex relaxation for Combinatorial Penalties

Convex relaxation for Combinatorial Penalties Convex relaxation for Combinatorial Penalties Guillaume Obozinski Equipe Imagine Laboratoire d Informatique Gaspard Monge Ecole des Ponts - ParisTech Joint work with Francis Bach Fête Parisienne in Computation,

More information

CALIBRATING NONCONVEX PENALIZED REGRESSION IN ULTRA-HIGH DIMENSION

CALIBRATING NONCONVEX PENALIZED REGRESSION IN ULTRA-HIGH DIMENSION The Annals of Statistics 2013, Vol. 41, No. 5, 2505 2536 DOI: 10.1214/13-AOS1159 Institute of Mathematical Statistics, 2013 CALIBRATING NONCONVEX PENALIZED REGRESSION IN ULTRA-HIGH DIMENSION BY LAN WANG

More information

Model Selection Tutorial 2: Problems With Using AIC to Select a Subset of Exposures in a Regression Model

Model Selection Tutorial 2: Problems With Using AIC to Select a Subset of Exposures in a Regression Model Model Selection Tutorial 2: Problems With Using AIC to Select a Subset of Exposures in a Regression Model Centre for Molecular, Environmental, Genetic & Analytic (MEGA) Epidemiology School of Population

More information

The Constrained Lasso

The Constrained Lasso The Constrained Lasso Gareth M. ames, Courtney Paulson and Paat Rusmevichientong Abstract Motivated by applications in areas as diverse as finance, image reconstruction, and curve estimation, we introduce

More information

CS 4491/CS 7990 SPECIAL TOPICS IN BIOINFORMATICS

CS 4491/CS 7990 SPECIAL TOPICS IN BIOINFORMATICS CS 4491/CS 7990 SPECIAL TOPICS IN BIOINFORMATICS * Some contents are adapted from Dr. Hung Huang and Dr. Chengkai Li at UT Arlington Mingon Kang, Ph.D. Computer Science, Kennesaw State University Problems

More information

Paper Review: Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties by Jianqing Fan and Runze Li (2001)

Paper Review: Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties by Jianqing Fan and Runze Li (2001) Paper Review: Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties by Jianqing Fan and Runze Li (2001) Presented by Yang Zhao March 5, 2010 1 / 36 Outlines 2 / 36 Motivation

More information

Sparse survival regression

Sparse survival regression Sparse survival regression Anders Gorst-Rasmussen gorst@math.aau.dk Department of Mathematics Aalborg University November 2010 1 / 27 Outline Penalized survival regression The semiparametric additive risk

More information

Variable selection using Adaptive Non-linear Interaction Structures in High dimensions

Variable selection using Adaptive Non-linear Interaction Structures in High dimensions Variable selection using Adaptive Non-linear Interaction Structures in High dimensions Peter Radchenko and Gareth M. James Abstract Numerous penalization based methods have been proposed for fitting a

More information

By Jianqing Fan 1, Yingying Fan 2 and Emre Barut Princeton University, University of Southern California and IBM T. J. Watson Research Center

By Jianqing Fan 1, Yingying Fan 2 and Emre Barut Princeton University, University of Southern California and IBM T. J. Watson Research Center The Annals of Statistics 204, Vol. 42, No., 324 35 DOI: 0.24/3-AOS9 c Institute of Mathematical Statistics, 204 ADAPTIVE ROBUST VARIABLE SELECTION arxiv:205.4795v3 [math.st] 2 Mar 204 By Jianqing Fan,

More information

LASSO-TYPE RECOVERY OF SPARSE REPRESENTATIONS FOR HIGH-DIMENSIONAL DATA

LASSO-TYPE RECOVERY OF SPARSE REPRESENTATIONS FOR HIGH-DIMENSIONAL DATA The Annals of Statistics 2009, Vol. 37, No. 1, 246 270 DOI: 10.1214/07-AOS582 Institute of Mathematical Statistics, 2009 LASSO-TYPE RECOVERY OF SPARSE REPRESENTATIONS FOR HIGH-DIMENSIONAL DATA BY NICOLAI

More information

(Part 1) High-dimensional statistics May / 41

(Part 1) High-dimensional statistics May / 41 Theory for the Lasso Recall the linear model Y i = p j=1 β j X (j) i + ɛ i, i = 1,..., n, or, in matrix notation, Y = Xβ + ɛ, To simplify, we assume that the design X is fixed, and that ɛ is N (0, σ 2

More information

MSA220/MVE440 Statistical Learning for Big Data

MSA220/MVE440 Statistical Learning for Big Data MSA220/MVE440 Statistical Learning for Big Data Lecture 9-10 - High-dimensional regression Rebecka Jörnsten Mathematical Sciences University of Gothenburg and Chalmers University of Technology Recap from

More information

Bayesian Grouped Horseshoe Regression with Application to Additive Models

Bayesian Grouped Horseshoe Regression with Application to Additive Models Bayesian Grouped Horseshoe Regression with Application to Additive Models Zemei Xu, Daniel F. Schmidt, Enes Makalic, Guoqi Qian, and John L. Hopper Centre for Epidemiology and Biostatistics, Melbourne

More information

SURE INDEPENDENCE SCREENING IN GENERALIZED LINEAR MODELS WITH NP-DIMENSIONALITY 1

SURE INDEPENDENCE SCREENING IN GENERALIZED LINEAR MODELS WITH NP-DIMENSIONALITY 1 The Annals of Statistics 2010, Vol. 38, No. 6, 3567 3604 DOI: 10.1214/10-AOS798 Institute of Mathematical Statistics, 2010 SURE INDEPENDENCE SCREENING IN GENERALIZED LINEAR MODELS WITH NP-DIMENSIONALITY

More information

Hard Thresholded Regression Via Linear Programming

Hard Thresholded Regression Via Linear Programming Hard Thresholded Regression Via Linear Programming Qiang Sun, Hongtu Zhu and Joseph G. Ibrahim Departments of Biostatistics, The University of North Carolina at Chapel Hill. Q. Sun, is Ph.D. student (E-mail:

More information

A Consistent Information Criterion for Support Vector Machines in Diverging Model Spaces

A Consistent Information Criterion for Support Vector Machines in Diverging Model Spaces Journal of Machine Learning Research 17 (2016) 1-26 Submitted 6/14; Revised 5/15; Published 4/16 A Consistent Information Criterion for Support Vector Machines in Diverging Model Spaces Xiang Zhang Yichao

More information

Regularization Algorithms for Learning

Regularization Algorithms for Learning DISI, UNIGE Texas, 10/19/07 plan motivation setting elastic net regularization - iterative thresholding algorithms - error estimates and parameter choice applications motivations starting point of many

More information

Lecture 2 Part 1 Optimization

Lecture 2 Part 1 Optimization Lecture 2 Part 1 Optimization (January 16, 2015) Mu Zhu University of Waterloo Need for Optimization E(y x), P(y x) want to go after them first, model some examples last week then, estimate didn t discuss

More information

High-dimensional Statistics

High-dimensional Statistics High-dimensional Statistics Pradeep Ravikumar UT Austin Outline 1. High Dimensional Data : Large p, small n 2. Sparsity 3. Group Sparsity 4. Low Rank 1 Curse of Dimensionality Statistical Learning: Given

More information

Model Selection and Geometry

Model Selection and Geometry Model Selection and Geometry Pascal Massart Université Paris-Sud, Orsay Leipzig, February Purpose of the talk! Concentration of measure plays a fundamental role in the theory of model selection! Model

More information

Linear Models for Regression CS534

Linear Models for Regression CS534 Linear Models for Regression CS534 Example Regression Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict

More information

A Constructive Approach to L 0 Penalized Regression

A Constructive Approach to L 0 Penalized Regression Journal of Machine Learning Research 9 (208) -37 Submitted 4/7; Revised 6/8; Published 8/8 A Constructive Approach to L 0 Penalized Regression Jian Huang Department of Applied Mathematics The Hong Kong

More information

regression Lie Wang Abstract In this paper, the high-dimensional sparse linear regression model is considered,

regression Lie Wang Abstract In this paper, the high-dimensional sparse linear regression model is considered, L penalized LAD estimator for high dimensional linear regression Lie Wang Abstract In this paper, the high-dimensional sparse linear regression model is considered, where the overall number of variables

More information

WEIGHTED QUANTILE REGRESSION THEORY AND ITS APPLICATION. Abstract

WEIGHTED QUANTILE REGRESSION THEORY AND ITS APPLICATION. Abstract Journal of Data Science,17(1). P. 145-160,2019 DOI:10.6339/JDS.201901_17(1).0007 WEIGHTED QUANTILE REGRESSION THEORY AND ITS APPLICATION Wei Xiong *, Maozai Tian 2 1 School of Statistics, University of

More information

Optimal prediction for sparse linear models? Lower bounds for coordinate-separable M-estimators

Optimal prediction for sparse linear models? Lower bounds for coordinate-separable M-estimators Electronic Journal of Statistics ISSN: 935-7524 arxiv: arxiv:503.0388 Optimal prediction for sparse linear models? Lower bounds for coordinate-separable M-estimators Yuchen Zhang, Martin J. Wainwright

More information

ASYMPTOTIC PROPERTIES OF BRIDGE ESTIMATORS IN SPARSE HIGH-DIMENSIONAL REGRESSION MODELS

ASYMPTOTIC PROPERTIES OF BRIDGE ESTIMATORS IN SPARSE HIGH-DIMENSIONAL REGRESSION MODELS ASYMPTOTIC PROPERTIES OF BRIDGE ESTIMATORS IN SPARSE HIGH-DIMENSIONAL REGRESSION MODELS Jian Huang 1, Joel L. Horowitz 2, and Shuangge Ma 3 1 Department of Statistics and Actuarial Science, University

More information

MS-C1620 Statistical inference

MS-C1620 Statistical inference MS-C1620 Statistical inference 10 Linear regression III Joni Virta Department of Mathematics and Systems Analysis School of Science Aalto University Academic year 2018 2019 Period III - IV 1 / 32 Contents

More information

The MNet Estimator. Patrick Breheny. Department of Biostatistics Department of Statistics University of Kentucky. August 2, 2010

The MNet Estimator. Patrick Breheny. Department of Biostatistics Department of Statistics University of Kentucky. August 2, 2010 Department of Biostatistics Department of Statistics University of Kentucky August 2, 2010 Joint work with Jian Huang, Shuangge Ma, and Cun-Hui Zhang Penalized regression methods Penalized methods have

More information

The Pennsylvania State University The Graduate School Eberly College of Science NEW PROCEDURES FOR COX S MODEL WITH HIGH DIMENSIONAL PREDICTORS

The Pennsylvania State University The Graduate School Eberly College of Science NEW PROCEDURES FOR COX S MODEL WITH HIGH DIMENSIONAL PREDICTORS The Pennsylvania State University The Graduate School Eberly College of Science NEW PROCEDURES FOR COX S MODEL WITH HIGH DIMENSIONAL PREDICTORS A Dissertation in Statistics by Ye Yu c 2015 Ye Yu Submitted

More information

Ultra High Dimensional Variable Selection with Endogenous Variables

Ultra High Dimensional Variable Selection with Endogenous Variables 1 / 39 Ultra High Dimensional Variable Selection with Endogenous Variables Yuan Liao Princeton University Joint work with Jianqing Fan Job Market Talk January, 2012 2 / 39 Outline 1 Examples of Ultra High

More information