Paper Review: Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties by Jianqing Fan and Runze Li (2001)
|
|
- Garry Hart
- 6 years ago
- Views:
Transcription
1 Paper Review: Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties by Jianqing Fan and Runze Li (2001) Presented by Yang Zhao March 5, / 36
2 Outlines 2 / 36
3 Motivation Variable selection is fundamental to high-dimensional statistical modeling Many approaches in use are stepwise selection procedures, which can be computationally expensive and ignore stochastic errors in the variable selection process The theoretical properties for stepwise deletion and subset selection are somewhat hard to understand The most severe drawback of the best subset variable selection is its lack of stability as analyzed by Breiman (1996) 3 / 36
4 Overview of Proposed Method Penalized likelihood approaches are proposed, which simultaneously select variables and estimate coefficients Penalty functions are symmetric, nonconcave on (0, ), and have singularities at the origin to produce sparse solutions Furthermore the penalty functions are bounded by a constant to reduce bias and satisfy certain conditions to yield continuous solutions A new algorithm is proposed for optimizing penalized likelihood functions, which is widely applicable Rates of convergence of the proposed penalized likelihood estimators are established With proper choice of regularization parameters, the proposed estimators perform as well as the oracle procedure (as if the correct submodel were known) 4 / 36
5 Penalized Least Squares and Subset Selection Linear regression model: y = X β + ɛ, where y is n 1 and X is n d. Assume for now that the columns of X are orthonormal. Denote z = X T y. A form of the penalized least squares is 1 y X 2 β 2 + λ d j=1 p j( β j ) = 1 y XX T y d 2 2 j=1 (z j β j ) 2 + λ d j=1 p j( β j ). p j (.) are not necessarily the same for all j; but we will assume they are the same for simplicity. From now on, denote λp(. ) by p λ (. ). The minimization problem is equivalent to minimize componentwise, which leads us to consider the penalized least squares problem 1 2 (z θ)2 + p λ ( θ ). (2.3) 5 / 36
6 Penalized Least Squares and Subset Selection Cont d By taking the hard thresholding penalty function p λ ( θ ) = λ 2 ( θ λ) 2 I ( θ < λ), the hard thresholding rule is obtained as ˆθ = zi ( z > λ). This coincides with the best subset selection and stepwise deletion and addition for orthonormal designs! 6 / 36
7 What Makes a Good Penalty Function? Unbiasedness: The resulting estimator is nearly unbiased when the true unknown parameter is large to avoid unnecessary modeling bias Sparsity: The resulting estimator is a thresholding rule, which automatically sets small estimated coefficients to zero to reduce model complexity Continuity: The resulting estimator is continuous in data z to avoid instability in model prediction 7 / 36
8 Sufficient Conditions for Good Penalty Functions p λ( θ ) = 0 for large θ Unbiasedness Intuition: The first order derivative of (2.3) w.r.t. θ is sgn(θ)( θ + p λ( θ )) z. When p λ( θ ) = 0 for large θ, the resulting estimator is z when z is sufficiently large, which is approximately unbiased. The minimum of the function θ + p λ( θ ) is positive Sparsity The minimum of the function θ + p λ( θ ) is attained at 0 Continuity Intuition: See nextpage 8 / 36
9 Sufficient Conditions for Good Penalty Functions Cont d when z < min θ 0 { θ + p λ( θ )}, the derivative of (2.3) is positive for all positive θ and negative for all negative θ ˆθ = 0; when z > min θ 0 { θ + p λ( θ )}, two crossings may exist as shown, the larger one is a penalized least squares estimator. further, this implies that a sufficient and necessary condition for continuity is to require the minimum to be attained at 0 9 / 36
10 Penalty Function of Interest hard thresholding penalty: p λ ( θ ) = λ 2 ( θ λ) 2 I ( θ < λ). soft thresholding penalty (L 1 penalty, LASSO): p λ ( θ ) = λ θ. L 2 penalty (ridge regression): p λ ( θ ) = λ θ 2. SCAD penalty (Smoothly Clipped Absolute Deviation penalty): p λ(θ) = λ{i (θ λ) + (aλ θ)+ I (θ > λ)} for some a > 2 and θ > 0. (a 1)λ Note: As for the three properties that make a penalty function good, only SCAD possesses all (and therefore is advocated by the authors); whereas all the other penalty functions are unable to satisfy three sufficient conditions simultaneously. 10 / 36
11 Penalty Function of Interest Cont d 11 / 36
12 Penalty Function of Interest Cont d When the design matrix X is orthonormal, closed form solutions are obtained as follows. hard thresholding rule: ˆθ = zi ( z > λ) LASSO (L 1 penalty): ˆθ = sgn(z)( z λ) + SCAD: sgn(z)( z λ) + ˆθ = {(a 1)z sgn(z)aλ}/(a 2) z when z 2λ when 2λ < z aλ when z > aλ 12 / 36
13 Penalty Function of Interest Cont d 13 / 36
14 More on SCAD SCAD penalty has two unknown parameters λ and a. They could be chosen by searching grids using cross-validation (CV) or generalized cross-validation (GCV), but that is computation intensive. The authors applied Bayesian risk analysis to help choose a, in that Bayesian risks seem not very sensitive to the values of a. It is found that a = 3.7 works similarly to that chosen by GCV. 14 / 36
15 Performance Comparison Refer to the last graph on previous slide. SCAD performs favorably compared with the other two thresholding rules. It is actually expected to perform the best, given that it retains all the good mathematical properties of the other two penalty functions. 15 / 36
16 Penalized Likelihood Formulation From now on, assume that X is standardized so that each column has mean 0 and variance 1, and is no longer orthonormal. A form of penalized least squares is 1 (y X 2 β)t (y X β) + n d j=1 p λ( β j ). (3.1) A form of penalized robust least squares is n i=1 ψ( y i x i β ) + n d j=1 p λ( β j ). (3.2) A form of penalized likelihood for generalized linear models is n i=1 l i(g(xi T β), y i ) + n d j=1 p λ( β j ). (3.3) 16 / 36
17 Sampling Properties and Oracle Properties Let β 0 = (β 10,..., β d0 ) T = (β T 10, β T 20) T. Assume that β 20 = 0. Let I (β 0 ) be the Fisher information matrix and I 1(β 10, 0) be the Fisher information knowing β 20 = 0. General setting: Let V i = (X i, Y i ), i = 1,..., n. Let L(β) be the log-likelihood function of observations V 1,..., V n and let Q(β) be the penalized likelihood function L(β) n d j=1 p λ( β j ). 17 / 36
18 Sampling Properties and Oracle Properties Cont d Regularity Conditions (A) The observations V i are iid with density f (V, β). f (V, β) has a common support and the model is identifiable. Furthermore, the following holds: E β ( logf (V,β) β j ) = 0 for j = 1,..., d and I jk (β) = E β ( β j logf (V, β) β k logf (V, β)) = E β ( (B) The Fisher information matrix I (β) = E{( β 2 β j β k logf (V, β)). logf (V, β))( β logf (V, β))t } is finite and positive definite at β = β 0. (C) There exists an open subset ω of Ω that contains the true parameter point β 0 such that for almost all V the density f (V, β) admits all third derivatives for all β ω. Furthermore, there exists functions M jkl such that 3 β i β k β l logf (V, β) M jkl (V ) for all β ω, where m j kl = E β (M jkl (V )) < for j, k, l. 18 / 36
19 Sampling Properties and Oracle Properties Cont d Theorem 1. Let V 1,..., V n be iid, each with a density f (V, β) that satisfies conditions (A)-(C). If max{ p λ n ( β j0 ) : β j0 0} 0, then there exists a local maximizer ˆβ of Q(β) such that ˆβ β 0 = O p(n 1/2 + a n), where a n = max{ p λ n ( β j0 ) : β j0 0}. Note: For hard thresholding and SCAD penalty functions, λ n 0 implies a n = 0, therefore the penalized likelihood estimator is root-n consistent. 19 / 36
20 Sampling Properties and Oracle Properties Cont d Lemma 1. Let V 1,..., V n be iid, each with a density f (V, β) that satisfies conditions (A)-(C). Assume that lim inf n lim inf θ 0 +p λ n (θ)/λ n > 0 (3.5). If λ n 0 and nλ n ad n, then with probability tending to 1, for any given β 1 satisfying β 1 β 10 = O p(n 1/2 ) and any constant C, Q{(β1 T, 0) T } = max β2 Cn 1/2Q{(βT 1, β2 T ) T } Aside: Denote Σ = diag{p λ n (β 10),..., p λ n (β s0)} and b = (p λ n (β 10)sgn(β 10),..., p λ n (β s0)sgn(β s0)) T, where s is the number of components of β 10. Theorem 2 (Oracle Property). Let V 1,..., V n be iid, each with a density f (V, β) that satisfies conditions (A)-(C). Assume that the penalty function p λn ( θ ) satisfies condition (3.5). If λ n 0 and nλ n, then with probability tending to 1, the root-n consistent local maximizers T T ˆβ = ( ˆβ 1, ˆβ 2 ) T in Theorem 1 must satisfy: (a) Sparsity: ˆβ 2 = 0 (b) Asymptotic normality: n(i1(β 10) + Σ){ ˆβ 1 β 10 + (I 1(β 10) + Σ) 1 b} N{0, I 1(β 10)} 20 / 36
21 Sampling Properties and Oracle Properties Cont d Lemma 1. Let V 1,..., V n be iid, each with a density f (V, β) that satisfies conditions (A)-(C). Assume that lim inf n lim inf θ 0 +p λ n (θ)/λ n > 0 (3.5). If λ n 0 and nλ n ad n, then with probability tending to 1, for any given β 1 satisfying β 1 β 10 = O p(n 1/2 ) and any constant C, Q{(β1 T, 0) T } = max β2 Cn 1/2Q{(βT 1, β2 T ) T } Aside: Denote Σ = diag{p λ n (β 10),..., p λ n (β s0)} and b = (p λ n (β 10)sgn(β 10),..., p λ n (β s0)sgn(β s0)) T, where s is the number of components of β 10. Theorem 2 (Oracle Property). Let V 1,..., V n be iid, each with a density f (V, β) that satisfies conditions (A)-(C). Assume that the penalty function p λn ( θ ) satisfies condition (3.5). If λ n 0 and nλ n, then with probability tending to 1, the root-n consistent local maximizers T T ˆβ = ( ˆβ 1, ˆβ 2 ) T in Theorem 1 must satisfy: (a) Sparsity: ˆβ 2 = 0 (b) Asymptotic normality: n(i1(β 10) + Σ){ ˆβ 1 β 10 + (I 1(β 10) + Σ) 1 b} N{0, I 1(β 10)} 21 / 36
22 Sampling Properties and Oracle Properties Cont d Lemma 1. Let V 1,..., V n be iid, each with a density f (V, β) that satisfies conditions (A)-(C). Assume that lim inf n lim inf θ 0 +p λ n (θ)/λ n > 0 (3.5). If λ n 0 and nλ n ad n, then with probability tending to 1, for any given β 1 satisfying β 1 β 10 = O p(n 1/2 ) and any constant C, Q{(β1 T, 0) T } = max β2 Cn 1/2Q{(βT 1, β2 T ) T } Aside: Denote Σ = diag{p λ n (β 10),..., p λ n (β s0)} and b = (p λ n (β 10)sgn(β 10),..., p λ n (β s0)sgn(β s0)) T, where s is the number of components of β 10. Theorem 2 (Oracle Property). Let V 1,..., V n be iid, each with a density f (V, β) that satisfies conditions (A)-(C). Assume that the penalty function p λn ( θ ) satisfies condition (3.5). If λ n 0 and nλ n, then with probability tending to 1, the root-n consistent local maximizers T T ˆβ = ( ˆβ 1, ˆβ 2 ) T in Theorem 1 must satisfy: (a) Sparsity: ˆβ 2 = 0 (b) Asymptotic normality: n(i1(β 10) + Σ){ ˆβ 1 β 10 + (I 1(β 10) + Σ) 1 b} N{0, I 1(β 10)} 22 / 36
23 Sampling Properties and Oracle Properties Cont d Remarks: (1)For the hard and SCAD thresholding penalty functions, if λ n 0, a n = 0. Hence, by Thm 2, when nλ n, their corresponding penalized likelihood estimators possess the oracle property and perform as well as the maximum likelihood estimates for estimating β 1 knowing β 2 = 0. (2)However, for LASSO (L 1) penalty, a n = λ n. Hence, the root-n consistency requires λ n = O p(n 1/2 ). On the other hand, the oracle property requires nλn. These two conditions cannot be satisfied simultaneously. The authors conjecture that the oracle property does not hold for LASSO. (3)For L q penalty with q < 1, the oracle property continues to hold with suitable choice of λ n. 23 / 36
24 Tibshirani (1996) proposed an algorithm for solving constrained least squares problem of LASSO Fu (1998) provided a shooting algorithm for LASSO LASSO2 submitted by Berwin Turlach at Statlib Here the authors proposed a unified algorithm, which optimizes problems (3.1) (3.2) (3.3) via local quadratic approximations. 24 / 36
25 The Algorithm Re-written (3.1) (3.2) (3.3) as l(β) + n d j=1 p λ( β j ) (3.6), where l(β) is a general loss function. Given an initial value β 0 that is close to the minimizer of (3.6). If β j0 is very close to 0, then set ˆβ j = 0; otherwise they can be locally approximated by a quadratic function as [p λ ( β j )] = p λ( β j )sgn(β j ) {p λ( β j0 )/ β j0 }β j. In other words, p λ ( β j ) p λ (β j0 ) {p λ( β j0 )/ β j0 }(β 2 j β 2 j0), for β j β j0. ψ( y x T β ) in robust penalized least squares case can be approximated by {ψ( y x T β 0 )/(y x T β 0) 2 }(y x T β) 2. assume the log-likelihood function is smooth, so can be locally approximated by a quadratic function. Newton-Raphson algorithm applies. The updating equation is β 1 = β 0 ( 2 l(β 0) + nσ λ (β 0)) 1 ( l(β 0) + nu λ (β 0)), where Σ λ (β 0) = diag{p λ( β 10 )/ β 10,..., p λ( β d0 )/ β d0 }, U λ (β 0) = Σ λ (β 0). 25 / 36
26 Standard Error Formula Obtained directly from the algorithm. Sandwich formula: cov( ˆ ˆβ 1) = ( 2 l( ˆβ 1) + nσ λ ( ˆβ 1)) 1 cov{ l( ˆ ˆβ 1)}( 2 l( ˆβ 1) + nσ λ ( ˆβ 1)) 1 for nonvanishing component of β. 26 / 36
27 Prediction and Model Error Two regression situations: X random and X controlled For ease of presentation, consider only X-random case PE(ˆµ) = E(Y ˆµx) 2 = E(Y E(Y x)) 2 + E(Y x ˆµ(x)) 2 The second component is the model error 27 / 36
28 Simulation Results 28 / 36
29 Simulation Results Cont d Note: SD = median absolute deviation of ˆβ 1/ SD m=median of ˆσ( ˆβ 1) SD mad =median absolute deviation of ˆσ( ˆβ 1) 29 / 36
30 Simulation Results Cont d 30 / 36
31 Real Data Analysis 31 / 36
32 A family of penalty functions was introduced, of which SCAD performs favorably Rates of convergence of the proposed penalized likelihood estimators were established With proper choice of regularization parameters, the estimators perform as well as the oracle procedure The unified algorithm was demonstrated effective and standard errors were estimated with good accuracy The proposed approach can be applied to various statistical contexts 32 / 36
33 Proofs 33 / 36
34 Proofs 34 / 36
35 Proofs 35 / 36
36 The End! 36 / 36
Nonconcave Penalized Likelihood with A Diverging Number of Parameters
Nonconcave Penalized Likelihood with A Diverging Number of Parameters Jianqing Fan and Heng Peng Presenter: Jiale Xu March 12, 2010 Jianqing Fan and Heng Peng Presenter: JialeNonconcave Xu () Penalized
More informationSmoothly Clipped Absolute Deviation (SCAD) for Correlated Variables
Smoothly Clipped Absolute Deviation (SCAD) for Correlated Variables LIB-MA, FSSM Cadi Ayyad University (Morocco) COMPSTAT 2010 Paris, August 22-27, 2010 Motivations Fan and Li (2001), Zou and Li (2008)
More informationAnalysis Methods for Supersaturated Design: Some Comparisons
Journal of Data Science 1(2003), 249-260 Analysis Methods for Supersaturated Design: Some Comparisons Runze Li 1 and Dennis K. J. Lin 2 The Pennsylvania State University Abstract: Supersaturated designs
More informationFull terms and conditions of use:
This article was downloaded by: [York University Libraries] On: 2 March 203, At: 9:24 Publisher: Taylor & Francis Informa Ltd Registered in England and Wales Registered Number: 07294 Registered office:
More informationVARIABLE SELECTION IN ROBUST LINEAR MODELS
The Pennsylvania State University The Graduate School Eberly College of Science VARIABLE SELECTION IN ROBUST LINEAR MODELS A Thesis in Statistics by Bo Kai c 2008 Bo Kai Submitted in Partial Fulfillment
More informationSelection of Smoothing Parameter for One-Step Sparse Estimates with L q Penalty
Journal of Data Science 9(2011), 549-564 Selection of Smoothing Parameter for One-Step Sparse Estimates with L q Penalty Masaru Kanba and Kanta Naito Shimane University Abstract: This paper discusses the
More informationTheoretical results for lasso, MCP, and SCAD
Theoretical results for lasso, MCP, and SCAD Patrick Breheny March 2 Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 1/23 Introduction There is an enormous body of literature concerning theoretical
More informationThe Adaptive Lasso and Its Oracle Properties Hui Zou (2006), JASA
The Adaptive Lasso and Its Oracle Properties Hui Zou (2006), JASA Presented by Dongjun Chung March 12, 2010 Introduction Definition Oracle Properties Computations Relationship: Nonnegative Garrote Extensions:
More informationConsistent high-dimensional Bayesian variable selection via penalized credible regions
Consistent high-dimensional Bayesian variable selection via penalized credible regions Howard Bondell bondell@stat.ncsu.edu Joint work with Brian Reich Howard Bondell p. 1 Outline High-Dimensional Variable
More informationGenomics, Transcriptomics and Proteomics in Clinical Research. Statistical Learning for Analyzing Functional Genomic Data. Explanation vs.
Genomics, Transcriptomics and Proteomics in Clinical Research Statistical Learning for Analyzing Functional Genomic Data German Cancer Research Center, Heidelberg, Germany June 16, 6 Diagnostics signatures
More informationMSA220/MVE440 Statistical Learning for Big Data
MSA220/MVE440 Statistical Learning for Big Data Lecture 9-10 - High-dimensional regression Rebecka Jörnsten Mathematical Sciences University of Gothenburg and Chalmers University of Technology Recap from
More informationLecture 14: Variable Selection - Beyond LASSO
Fall, 2017 Extension of LASSO To achieve oracle properties, L q penalty with 0 < q < 1, SCAD penalty (Fan and Li 2001; Zhang et al. 2007). Adaptive LASSO (Zou 2006; Zhang and Lu 2007; Wang et al. 2007)
More informationA UNIFIED APPROACH TO MODEL SELECTION AND SPARS. REGULARIZED LEAST SQUARES by Jinchi Lv and Yingying Fan The annals of Statistics (2009)
A UNIFIED APPROACH TO MODEL SELECTION AND SPARSE RECOVERY USING REGULARIZED LEAST SQUARES by Jinchi Lv and Yingying Fan The annals of Statistics (2009) Mar. 19. 2010 Outline 1 2 Sideline information Notations
More informationVariable Selection in Measurement Error Models 1
Variable Selection in Measurement Error Models 1 Yanyuan Ma Institut de Statistique Université de Neuchâtel Pierre à Mazel 7, 2000 Neuchâtel Runze Li Department of Statistics Pennsylvania State University
More informationBayesian variable selection via. Penalized credible regions. Brian Reich, NCSU. Joint work with. Howard Bondell and Ander Wilson
Bayesian variable selection via penalized credible regions Brian Reich, NC State Joint work with Howard Bondell and Ander Wilson Brian Reich, NCSU Penalized credible regions 1 Motivation big p, small n
More informationThe MNet Estimator. Patrick Breheny. Department of Biostatistics Department of Statistics University of Kentucky. August 2, 2010
Department of Biostatistics Department of Statistics University of Kentucky August 2, 2010 Joint work with Jian Huang, Shuangge Ma, and Cun-Hui Zhang Penalized regression methods Penalized methods have
More informationLecture 5: Soft-Thresholding and Lasso
High Dimensional Data and Statistical Learning Lecture 5: Soft-Thresholding and Lasso Weixing Song Department of Statistics Kansas State University Weixing Song STAT 905 October 23, 2014 1/54 Outline Penalized
More informationRegression, Ridge Regression, Lasso
Regression, Ridge Regression, Lasso Fabio G. Cozman - fgcozman@usp.br October 2, 2018 A general definition Regression studies the relationship between a response variable Y and covariates X 1,..., X n.
More informationGeneralized Elastic Net Regression
Abstract Generalized Elastic Net Regression Geoffroy MOURET Jean-Jules BRAULT Vahid PARTOVINIA This work presents a variation of the elastic net penalization method. We propose applying a combined l 1
More informationTuning parameter selectors for the smoothly clipped absolute deviation method
Tuning parameter selectors for the smoothly clipped absolute deviation method Hansheng Wang Guanghua School of Management, Peking University, Beijing, China, 100871 hansheng@gsm.pku.edu.cn Runze Li Department
More informationESL Chap3. Some extensions of lasso
ESL Chap3 Some extensions of lasso 1 Outline Consistency of lasso for model selection Adaptive lasso Elastic net Group lasso 2 Consistency of lasso for model selection A number of authors have studied
More informationBi-level feature selection with applications to genetic association
Bi-level feature selection with applications to genetic association studies October 15, 2008 Motivation In many applications, biological features possess a grouping structure Categorical variables may
More informationThe Pennsylvania State University The Graduate School Eberly College of Science NEW PROCEDURES FOR COX S MODEL WITH HIGH DIMENSIONAL PREDICTORS
The Pennsylvania State University The Graduate School Eberly College of Science NEW PROCEDURES FOR COX S MODEL WITH HIGH DIMENSIONAL PREDICTORS A Dissertation in Statistics by Ye Yu c 2015 Ye Yu Submitted
More informationConfidence Intervals for Low-dimensional Parameters with High-dimensional Data
Confidence Intervals for Low-dimensional Parameters with High-dimensional Data Cun-Hui Zhang and Stephanie S. Zhang Rutgers University and Columbia University September 14, 2012 Outline Introduction Methodology
More informationShrinkage Tuning Parameter Selection in Precision Matrices Estimation
arxiv:0909.1123v1 [stat.me] 7 Sep 2009 Shrinkage Tuning Parameter Selection in Precision Matrices Estimation Heng Lian Division of Mathematical Sciences School of Physical and Mathematical Sciences Nanyang
More informationPeter Hoff Minimax estimation November 12, Motivation and definition. 2 Least favorable prior 3. 3 Least favorable prior sequence 11
Contents 1 Motivation and definition 1 2 Least favorable prior 3 3 Least favorable prior sequence 11 4 Nonparametric problems 15 5 Minimax and admissibility 18 6 Superefficiency and sparsity 19 Most of
More informationLecture 3: More on regularization. Bayesian vs maximum likelihood learning
Lecture 3: More on regularization. Bayesian vs maximum likelihood learning L2 and L1 regularization for linear estimators A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting
More informationVariable Selection in Restricted Linear Regression Models. Y. Tuaç 1 and O. Arslan 1
Variable Selection in Restricted Linear Regression Models Y. Tuaç 1 and O. Arslan 1 Ankara University, Faculty of Science, Department of Statistics, 06100 Ankara/Turkey ytuac@ankara.edu.tr, oarslan@ankara.edu.tr
More informationStability and the elastic net
Stability and the elastic net Patrick Breheny March 28 Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 1/32 Introduction Elastic Net Our last several lectures have concentrated on methods for
More informationBiostatistics Advanced Methods in Biostatistics IV
Biostatistics 140.754 Advanced Methods in Biostatistics IV Jeffrey Leek Assistant Professor Department of Biostatistics jleek@jhsph.edu Lecture 12 1 / 36 Tip + Paper Tip: As a statistician the results
More informationRegression Shrinkage and Selection via the Lasso
Regression Shrinkage and Selection via the Lasso ROBERT TIBSHIRANI, 1996 Presenter: Guiyun Feng April 27 () 1 / 20 Motivation Estimation in Linear Models: y = β T x + ɛ. data (x i, y i ), i = 1, 2,...,
More informationVariable Selection in Multivariate Multiple Regression
Variable Selection in Multivariate Multiple Regression by c Anita Brobbey A thesis submitted to the School of Graduate Studies in partial fulfillment of the requirement for the Degree of Master of Science
More informationSOLVING NON-CONVEX LASSO TYPE PROBLEMS WITH DC PROGRAMMING. Gilles Gasso, Alain Rakotomamonjy and Stéphane Canu
SOLVING NON-CONVEX LASSO TYPE PROBLEMS WITH DC PROGRAMMING Gilles Gasso, Alain Rakotomamonjy and Stéphane Canu LITIS - EA 48 - INSA/Universite de Rouen Avenue de l Université - 768 Saint-Etienne du Rouvray
More informationDiscussion of Regularization of Wavelets Approximations by A. Antoniadis and J. Fan
Discussion of Regularization of Wavelets Approximations by A. Antoniadis and J. Fan T. Tony Cai Department of Statistics The Wharton School University of Pennsylvania Professors Antoniadis and Fan are
More informationA Modern Look at Classical Multivariate Techniques
A Modern Look at Classical Multivariate Techniques Yoonkyung Lee Department of Statistics The Ohio State University March 16-20, 2015 The 13th School of Probability and Statistics CIMAT, Guanajuato, Mexico
More informationCovariance function estimation in Gaussian process regression
Covariance function estimation in Gaussian process regression François Bachoc Department of Statistics and Operations Research, University of Vienna WU Research Seminar - May 2015 François Bachoc Gaussian
More informationUltra High Dimensional Variable Selection with Endogenous Variables
1 / 39 Ultra High Dimensional Variable Selection with Endogenous Variables Yuan Liao Princeton University Joint work with Jianqing Fan Job Market Talk January, 2012 2 / 39 Outline 1 Examples of Ultra High
More informationLikelihood-Based Methods
Likelihood-Based Methods Handbook of Spatial Statistics, Chapter 4 Susheela Singh September 22, 2016 OVERVIEW INTRODUCTION MAXIMUM LIKELIHOOD ESTIMATION (ML) RESTRICTED MAXIMUM LIKELIHOOD ESTIMATION (REML)
More informationMachine Learning for OR & FE
Machine Learning for OR & FE Regression II: Regularization and Shrinkage Methods Martin Haugh Department of Industrial Engineering and Operations Research Columbia University Email: martin.b.haugh@gmail.com
More informationDirect Learning: Linear Regression. Donglin Zeng, Department of Biostatistics, University of North Carolina
Direct Learning: Linear Regression Parametric learning We consider the core function in the prediction rule to be a parametric function. The most commonly used function is a linear function: squared loss:
More informationModel comparison and selection
BS2 Statistical Inference, Lectures 9 and 10, Hilary Term 2008 March 2, 2008 Hypothesis testing Consider two alternative models M 1 = {f (x; θ), θ Θ 1 } and M 2 = {f (x; θ), θ Θ 2 } for a sample (X = x)
More informationLogistic Regression with the Nonnegative Garrote
Logistic Regression with the Nonnegative Garrote Enes Makalic Daniel F. Schmidt Centre for MEGA Epidemiology The University of Melbourne 24th Australasian Joint Conference on Artificial Intelligence 2011
More informationPeter Hoff Minimax estimation October 31, Motivation and definition. 2 Least favorable prior 3. 3 Least favorable prior sequence 11
Contents 1 Motivation and definition 1 2 Least favorable prior 3 3 Least favorable prior sequence 11 4 Nonparametric problems 15 5 Minimax and admissibility 18 6 Superefficiency and sparsity 19 Most of
More informationRelaxed Lasso. Nicolai Meinshausen December 14, 2006
Relaxed Lasso Nicolai Meinshausen nicolai@stat.berkeley.edu December 14, 2006 Abstract The Lasso is an attractive regularisation method for high dimensional regression. It combines variable selection with
More informationThe Sparsity and Bias of The LASSO Selection In High-Dimensional Linear Regression
The Sparsity and Bias of The LASSO Selection In High-Dimensional Linear Regression Cun-hui Zhang and Jian Huang Presenter: Quefeng Li Feb. 26, 2010 un-hui Zhang and Jian Huang Presenter: Quefeng The Sparsity
More informationHigh dimensional thresholded regression and shrinkage effect
J. R. Statist. Soc. B (014) 76, Part 3, pp. 67 649 High dimensional thresholded regression and shrinkage effect Zemin Zheng, Yingying Fan and Jinchi Lv University of Southern California, Los Angeles, USA
More informationVariable Selection and Parameter Estimation Using a Continuous and Differentiable Approximation to the L0 Penalty Function
Brigham Young University BYU ScholarsArchive All Theses and Dissertations 2011-03-10 Variable Selection and Parameter Estimation Using a Continuous and Differentiable Approximation to the L0 Penalty Function
More informationESTIMATION AND VARIABLE SELECTION FOR GENERALIZED ADDITIVE PARTIAL LINEAR MODELS
Submitted to the Annals of Statistics ESTIMATION AND VARIABLE SELECTION FOR GENERALIZED ADDITIVE PARTIAL LINEAR MODELS BY LI WANG, XIANG LIU, HUA LIANG AND RAYMOND J. CARROLL University of Georgia, University
More informationModel Selection and Geometry
Model Selection and Geometry Pascal Massart Université Paris-Sud, Orsay Leipzig, February Purpose of the talk! Concentration of measure plays a fundamental role in the theory of model selection! Model
More informationPermutation-invariant regularization of large covariance matrices. Liza Levina
Liza Levina Permutation-invariant covariance regularization 1/42 Permutation-invariant regularization of large covariance matrices Liza Levina Department of Statistics University of Michigan Joint work
More informationLecture 6: Methods for high-dimensional problems
Lecture 6: Methods for high-dimensional problems Hector Corrada Bravo and Rafael A. Irizarry March, 2010 In this Section we will discuss methods where data lies on high-dimensional spaces. In particular,
More informationRirdge Regression. Szymon Bobek. Institute of Applied Computer science AGH University of Science and Technology
Rirdge Regression Szymon Bobek Institute of Applied Computer science AGH University of Science and Technology Based on Carlos Guestrin adn Emily Fox slides from Coursera Specialization on Machine Learnign
More informationAdaptive Lasso for Cox s Proportional Hazards Model
Adaptive Lasso for Cox s Proportional Hazards Model By HAO HELEN ZHANG AND WENBIN LU Department of Statistics, North Carolina State University, Raleigh, North Carolina 27695-8203, U.S.A. hzhang@stat.ncsu.edu
More informationHigh-dimensional regression
High-dimensional regression Advanced Methods for Data Analysis 36-402/36-608) Spring 2014 1 Back to linear regression 1.1 Shortcomings Suppose that we are given outcome measurements y 1,... y n R, and
More informationBayesian Grouped Horseshoe Regression with Application to Additive Models
Bayesian Grouped Horseshoe Regression with Application to Additive Models Zemei Xu, Daniel F. Schmidt, Enes Makalic, Guoqi Qian, and John L. Hopper Centre for Epidemiology and Biostatistics, Melbourne
More informationTECHNICAL REPORT NO. 1091r. A Note on the Lasso and Related Procedures in Model Selection
DEPARTMENT OF STATISTICS University of Wisconsin 1210 West Dayton St. Madison, WI 53706 TECHNICAL REPORT NO. 1091r April 2004, Revised December 2004 A Note on the Lasso and Related Procedures in Model
More informationRegularization in Cox Frailty Models
Regularization in Cox Frailty Models Andreas Groll 1, Trevor Hastie 2, Gerhard Tutz 3 1 Ludwig-Maximilians-Universität Munich, Department of Mathematics, Theresienstraße 39, 80333 Munich, Germany 2 University
More informationSingle Index Quantile Regression for Heteroscedastic Data
Single Index Quantile Regression for Heteroscedastic Data E. Christou M. G. Akritas Department of Statistics The Pennsylvania State University SMAC, November 6, 2015 E. Christou, M. G. Akritas (PSU) SIQR
More informationComments on \Wavelets in Statistics: A Review" by. A. Antoniadis. Jianqing Fan. University of North Carolina, Chapel Hill
Comments on \Wavelets in Statistics: A Review" by A. Antoniadis Jianqing Fan University of North Carolina, Chapel Hill and University of California, Los Angeles I would like to congratulate Professor Antoniadis
More informationLinear regression methods
Linear regression methods Most of our intuition about statistical methods stem from linear regression. For observations i = 1,..., n, the model is Y i = p X ij β j + ε i, j=1 where Y i is the response
More informationA Bootstrap Lasso + Partial Ridge Method to Construct Confidence Intervals for Parameters in High-dimensional Sparse Linear Models
A Bootstrap Lasso + Partial Ridge Method to Construct Confidence Intervals for Parameters in High-dimensional Sparse Linear Models Jingyi Jessica Li Department of Statistics University of California, Los
More informationSTAT 462-Computational Data Analysis
STAT 462-Computational Data Analysis Chapter 5- Part 2 Nasser Sadeghkhani a.sadeghkhani@queensu.ca October 2017 1 / 27 Outline Shrinkage Methods 1. Ridge Regression 2. Lasso Dimension Reduction Methods
More informationThe lasso. Patrick Breheny. February 15. The lasso Convex optimization Soft thresholding
Patrick Breheny February 15 Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 1/24 Introduction Last week, we introduced penalized regression and discussed ridge regression, in which the penalty
More informationGaussian Graphical Models and Graphical Lasso
ELE 538B: Sparsity, Structure and Inference Gaussian Graphical Models and Graphical Lasso Yuxin Chen Princeton University, Spring 2017 Multivariate Gaussians Consider a random vector x N (0, Σ) with pdf
More informationSTAT 535 Lecture 5 November, 2018 Brief overview of Model Selection and Regularization c Marina Meilă
STAT 535 Lecture 5 November, 2018 Brief overview of Model Selection and Regularization c Marina Meilă mmp@stat.washington.edu Reading: Murphy: BIC, AIC 8.4.2 (pp 255), SRM 6.5 (pp 204) Hastie, Tibshirani
More informationData Mining Stat 588
Data Mining Stat 588 Lecture 02: Linear Methods for Regression Department of Statistics & Biostatistics Rutgers University September 13 2011 Regression Problem Quantitative generic output variable Y. Generic
More informationStatistical challenges with high dimensionality: feature selection in knowledge discovery
Statistical challenges with high dimensionality: feature selection in knowledge discovery Jianqing Fan and Runze Li Abstract. Technological innovations have revolutionized the process of scientific research
More informationStatistics 203: Introduction to Regression and Analysis of Variance Penalized models
Statistics 203: Introduction to Regression and Analysis of Variance Penalized models Jonathan Taylor - p. 1/15 Today s class Bias-Variance tradeoff. Penalized regression. Cross-validation. - p. 2/15 Bias-variance
More informationVARIABLE SELECTION IN QUANTILE REGRESSION
Statistica Sinica 19 (2009), 801-817 VARIABLE SELECTION IN QUANTILE REGRESSION Yichao Wu and Yufeng Liu North Carolina State University and University of North Carolina, Chapel Hill Abstract: After its
More informationAdaptive Piecewise Polynomial Estimation via Trend Filtering
Adaptive Piecewise Polynomial Estimation via Trend Filtering Liubo Li, ShanShan Tu The Ohio State University li.2201@osu.edu, tu.162@osu.edu October 1, 2015 Liubo Li, ShanShan Tu (OSU) Trend Filtering
More informationRegression Estimation - Least Squares and Maximum Likelihood. Dr. Frank Wood
Regression Estimation - Least Squares and Maximum Likelihood Dr. Frank Wood Least Squares Max(min)imization Function to minimize w.r.t. β 0, β 1 Q = n (Y i (β 0 + β 1 X i )) 2 i=1 Minimize this by maximizing
More informationComparisons of penalized least squares. methods by simulations
Comparisons of penalized least squares arxiv:1405.1796v1 [stat.co] 8 May 2014 methods by simulations Ke ZHANG, Fan YIN University of Science and Technology of China, Hefei 230026, China Shifeng XIONG Academy
More informationIterative Selection Using Orthogonal Regression Techniques
Iterative Selection Using Orthogonal Regression Techniques Bradley Turnbull 1, Subhashis Ghosal 1 and Hao Helen Zhang 2 1 Department of Statistics, North Carolina State University, Raleigh, NC, USA 2 Department
More informationThe deterministic Lasso
The deterministic Lasso Sara van de Geer Seminar für Statistik, ETH Zürich Abstract We study high-dimensional generalized linear models and empirical risk minimization using the Lasso An oracle inequality
More informationFeature selection with high-dimensional data: criteria and Proc. Procedures
Feature selection with high-dimensional data: criteria and Procedures Zehua Chen Department of Statistics & Applied Probability National University of Singapore Conference in Honour of Grace Wahba, June
More informationLecture 2 Part 1 Optimization
Lecture 2 Part 1 Optimization (January 16, 2015) Mu Zhu University of Waterloo Need for Optimization E(y x), P(y x) want to go after them first, model some examples last week then, estimate didn t discuss
More informationMS-C1620 Statistical inference
MS-C1620 Statistical inference 10 Linear regression III Joni Virta Department of Mathematics and Systems Analysis School of Science Aalto University Academic year 2018 2019 Period III - IV 1 / 32 Contents
More informationHigh-dimensional Covariance Estimation Based On Gaussian Graphical Models
High-dimensional Covariance Estimation Based On Gaussian Graphical Models Shuheng Zhou, Philipp Rutimann, Min Xu and Peter Buhlmann February 3, 2012 Problem definition Want to estimate the covariance matrix
More informationHigh-dimensional regression with unknown variance
High-dimensional regression with unknown variance Christophe Giraud Ecole Polytechnique march 2012 Setting Gaussian regression with unknown variance: Y i = f i + ε i with ε i i.i.d. N (0, σ 2 ) f = (f
More informationStatistical Data Mining and Machine Learning Hilary Term 2016
Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes
More informationLinear Methods for Regression. Lijun Zhang
Linear Methods for Regression Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Linear Regression Models and Least Squares Subset Selection Shrinkage Methods Methods Using Derived
More informationNonparametric Regression. Badr Missaoui
Badr Missaoui Outline Kernel and local polynomial regression. Penalized regression. We are given n pairs of observations (X 1, Y 1 ),...,(X n, Y n ) where Y i = r(x i ) + ε i, i = 1,..., n and r(x) = E(Y
More informationGraduate Econometrics I: Unbiased Estimation
Graduate Econometrics I: Unbiased Estimation Yves Dominicy Université libre de Bruxelles Solvay Brussels School of Economics and Management ECARES Yves Dominicy Graduate Econometrics I: Unbiased Estimation
More informationThe annals of Statistics (2006)
High dimensional graphs and variable selection with the Lasso Nicolai Meinshausen and Peter Buhlmann The annals of Statistics (2006) presented by Jee Young Moon Feb. 19. 2010 High dimensional graphs and
More informationDimension Reduction Methods
Dimension Reduction Methods And Bayesian Machine Learning Marek Petrik 2/28 Previously in Machine Learning How to choose the right features if we have (too) many options Methods: 1. Subset selection 2.
More informationConsistent Group Identification and Variable Selection in Regression with Correlated Predictors
Consistent Group Identification and Variable Selection in Regression with Correlated Predictors Dhruv B. Sharma, Howard D. Bondell and Hao Helen Zhang Abstract Statistical procedures for variable selection
More informationHigh-dimensional covariance estimation based on Gaussian graphical models
High-dimensional covariance estimation based on Gaussian graphical models Shuheng Zhou Department of Statistics, The University of Michigan, Ann Arbor IMA workshop on High Dimensional Phenomena Sept. 26,
More informationWEIGHTED QUANTILE REGRESSION THEORY AND ITS APPLICATION. Abstract
Journal of Data Science,17(1). P. 145-160,2019 DOI:10.6339/JDS.201901_17(1).0007 WEIGHTED QUANTILE REGRESSION THEORY AND ITS APPLICATION Wei Xiong *, Maozai Tian 2 1 School of Statistics, University of
More informationSample Size Requirement For Some Low-Dimensional Estimation Problems
Sample Size Requirement For Some Low-Dimensional Estimation Problems Cun-Hui Zhang, Rutgers University September 10, 2013 SAMSI Thanks for the invitation! Acknowledgements/References Sun, T. and Zhang,
More informationLong-Run Covariability
Long-Run Covariability Ulrich K. Müller and Mark W. Watson Princeton University October 2016 Motivation Study the long-run covariability/relationship between economic variables great ratios, long-run Phillips
More informationRobust Variable Selection Methods for Grouped Data. Kristin Lee Seamon Lilly
Robust Variable Selection Methods for Grouped Data by Kristin Lee Seamon Lilly A dissertation submitted to the Graduate Faculty of Auburn University in partial fulfillment of the requirements for the Degree
More informationECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference
ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Sparse Recovery using L1 minimization - algorithms Yuejie Chi Department of Electrical and Computer Engineering Spring
More informationMSA220/MVE440 Statistical Learning for Big Data
MSA220/MVE440 Statistical Learning for Big Data Lecture 7/8 - High-dimensional modeling part 1 Rebecka Jörnsten Mathematical Sciences University of Gothenburg and Chalmers University of Technology Classification
More informationOn Mixture Regression Shrinkage and Selection via the MR-LASSO
On Mixture Regression Shrinage and Selection via the MR-LASSO Ronghua Luo, Hansheng Wang, and Chih-Ling Tsai Guanghua School of Management, Peing University & Graduate School of Management, University
More informationParameter Norm Penalties. Sargur N. Srihari
Parameter Norm Penalties Sargur N. srihari@cedar.buffalo.edu 1 Regularization Strategies 1. Parameter Norm Penalties 2. Norm Penalties as Constrained Optimization 3. Regularization and Underconstrained
More informationMODEL SELECTION FOR CORRELATED DATA WITH DIVERGING NUMBER OF PARAMETERS
Statistica Sinica 23 (2013), 901-927 doi:http://dx.doi.org/10.5705/ss.2011.058 MODEL SELECTION FOR CORRELATED DATA WITH DIVERGING NUMBER OF PARAMETERS Hyunkeun Cho and Annie Qu University of Illinois at
More informationRobust Variable Selection Through MAVE
Robust Variable Selection Through MAVE Weixin Yao and Qin Wang Abstract Dimension reduction and variable selection play important roles in high dimensional data analysis. Wang and Yin (2008) proposed sparse
More informationEconometrics I, Estimation
Econometrics I, Estimation Department of Economics Stanford University September, 2008 Part I Parameter, Estimator, Estimate A parametric is a feature of the population. An estimator is a function of the
More informationMachine Learning for Economists: Part 4 Shrinkage and Sparsity
Machine Learning for Economists: Part 4 Shrinkage and Sparsity Michal Andrle International Monetary Fund Washington, D.C., October, 2018 Disclaimer #1: The views expressed herein are those of the authors
More informationChapter 3. Linear Models for Regression
Chapter 3. Linear Models for Regression Wei Pan Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN 55455 Email: weip@biostat.umn.edu PubH 7475/8475 c Wei Pan Linear
More information