Nonconcave Penalized Likelihood with A Diverging Number of Parameters

Similar documents
Paper Review: Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties by Jianqing Fan and Runze Li (2001)

Smoothly Clipped Absolute Deviation (SCAD) for Correlated Variables

Regression, Ridge Regression, Lasso

The Adaptive Lasso and Its Oracle Properties Hui Zou (2006), JASA

Variable Selection in Measurement Error Models 1

Analysis Methods for Supersaturated Design: Some Comparisons

Theoretical results for lasso, MCP, and SCAD

Lecture 14: Variable Selection - Beyond LASSO

Lasso Maximum Likelihood Estimation of Parametric Models with Singular Information Matrices

Direct Learning: Linear Regression. Donglin Zeng, Department of Biostatistics, University of North Carolina

Selection of Smoothing Parameter for One-Step Sparse Estimates with L q Penalty

A UNIFIED APPROACH TO MODEL SELECTION AND SPARS. REGULARIZED LEAST SQUARES by Jinchi Lv and Yingying Fan The annals of Statistics (2009)

VARIABLE SELECTION IN ROBUST LINEAR MODELS

Generalized Elastic Net Regression

Linear regression methods

Shrinkage Tuning Parameter Selection in Precision Matrices Estimation

Confidence Intervals for Low-dimensional Parameters with High-dimensional Data

MODEL SELECTION FOR CORRELATED DATA WITH DIVERGING NUMBER OF PARAMETERS

Model Selection and Geometry

Model comparison and selection

The lasso. Patrick Breheny. February 15. The lasso Convex optimization Soft thresholding

Consistent high-dimensional Bayesian variable selection via penalized credible regions

MSA220/MVE440 Statistical Learning for Big Data

BTRY 4090: Spring 2009 Theory of Statistics

Chapter 3. Linear Models for Regression

Variable Selection for Highly Correlated Predictors

A Bootstrap Lasso + Partial Ridge Method to Construct Confidence Intervals for Parameters in High-dimensional Sparse Linear Models

A Modern Look at Classical Multivariate Techniques

Ultra High Dimensional Variable Selection with Endogenous Variables

Linear Methods for Regression. Lijun Zhang

M-estimation in high-dimensional linear model

The MNet Estimator. Patrick Breheny. Department of Biostatistics Department of Statistics University of Kentucky. August 2, 2010

Bayesian Grouped Horseshoe Regression with Application to Additive Models

Econometrics I, Estimation

Genomics, Transcriptomics and Proteomics in Clinical Research. Statistical Learning for Analyzing Functional Genomic Data. Explanation vs.

Data Mining Stat 588

Simple and Multiple Linear Regression

STAT 100C: Linear models

High-dimensional regression with unknown variance

Comparisons of penalized least squares. methods by simulations

Chapter 7. Hypothesis Testing

Likelihood-Based Methods

ASYMPTOTIC PROPERTIES OF BRIDGE ESTIMATORS IN SPARSE HIGH-DIMENSIONAL REGRESSION MODELS

Statistics 203: Introduction to Regression and Analysis of Variance Penalized models

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Consistency of Test-based Criterion for Selection of Variables in High-dimensional Two Group-Discriminant Analysis

A Short Introduction to the Lasso Methodology

Economics 583: Econometric Theory I A Primer on Asymptotics

High-dimensional covariance estimation based on Gaussian graphical models

Feature selection with high-dimensional data: criteria and Proc. Procedures

Statistics Ph.D. Qualifying Exam

Bayesian variable selection via. Penalized credible regions. Brian Reich, NCSU. Joint work with. Howard Bondell and Ander Wilson

MISCELLANEOUS TOPICS RELATED TO LIKELIHOOD. Copyright c 2012 (Iowa State University) Statistics / 30

SOLVING NON-CONVEX LASSO TYPE PROBLEMS WITH DC PROGRAMMING. Gilles Gasso, Alain Rakotomamonjy and Stéphane Canu

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

High Dimensional Empirical Likelihood for Generalized Estimating Equations with Dependent Data

ESL Chap3. Some extensions of lasso

Let us first identify some classes of hypotheses. simple versus simple. H 0 : θ = θ 0 versus H 1 : θ = θ 1. (1) one-sided

Lecture 3. Inference about multivariate normal distribution

9. Model Selection. statistical models. overview of model selection. information criteria. goodness-of-fit measures

Machine Learning for OR & FE

Variable Selection and Parameter Estimation Using a Continuous and Differentiable Approximation to the L0 Penalty Function

Full terms and conditions of use:

Statistical challenges with high dimensionality: feature selection in knowledge discovery

Tuning parameter selectors for the smoothly clipped absolute deviation method

Linear model selection and regularization

Analysis of the AIC Statistic for Optimal Detection of Small Changes in Dynamic Systems

A Confidence Region Approach to Tuning for Variable Selection

Sparse Linear Models (10/7/13)

ORACLE MODEL SELECTION FOR NONLINEAR MODELS BASED ON WEIGHTED COMPOSITE QUANTILE REGRESSION

Machine learning, shrinkage estimation, and economic theory

Introduction to Estimation Methods for Time Series models Lecture 2

Likelihood Ratio Tests. that Certain Variance Components Are Zero. Ciprian M. Crainiceanu. Department of Statistical Science

Single Index Quantile Regression for Heteroscedastic Data

Discussion of the paper Inference for Semiparametric Models: Some Questions and an Answer by Bickel and Kwon

Lecture 7 Introduction to Statistical Decision Theory

ORIE 4741: Learning with Big Messy Data. Regularization

BIO5312 Biostatistics Lecture 13: Maximum Likelihood Estimation

Dimension Reduction Methods

Sample Size Requirement For Some Low-Dimensional Estimation Problems

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

Advanced Statistics I : Gaussian Linear Model (and beyond)

Permutation-invariant regularization of large covariance matrices. Liza Levina

arxiv: v1 [stat.me] 30 Dec 2017

ASYMPTOTIC PROPERTIES OF BRIDGE ESTIMATORS IN SPARSE HIGH-DIMENSIONAL REGRESSION MODELS

Gaussian Graphical Models and Graphical Lasso

Linear Regression. September 27, Chapter 3. Chapter 3 September 27, / 77

Stability and the elastic net

Identification and Estimation for Generalized Varying Coefficient Partially Linear Models

The Iterated Lasso for High-Dimensional Logistic Regression

Linear Model Selection and Regularization

Variable Selection in Restricted Linear Regression Models. Y. Tuaç 1 and O. Arslan 1

Consistency of test based method for selection of variables in high dimensional two group discriminant analysis

arxiv: v2 [math.st] 9 Feb 2017

Inference For High Dimensional M-estimates. Fixed Design Results

Bi-level feature selection with applications to genetic association

Relaxed Lasso. Nicolai Meinshausen December 14, 2006

Asymptotic Statistics-III. Changliang Zou

Statistical Data Mining and Machine Learning Hilary Term 2016

Peter Hoff Minimax estimation November 12, Motivation and definition. 2 Least favorable prior 3. 3 Least favorable prior sequence 11

Transcription:

Nonconcave Penalized Likelihood with A Diverging Number of Parameters Jianqing Fan and Heng Peng Presenter: Jiale Xu March 12, 2010 Jianqing Fan and Heng Peng Presenter: JialeNonconcave Xu () Penalized Likelihood with A Diverging Number of March Parameters 12, 2010 1 / 32

Outlines Introduction Penalty function Properties of penalized likelihood estimation Proof of theorems Numerical examples Jianqing Fan and Heng Peng Presenter: JialeNonconcave Xu () Penalized Likelihood with A Diverging Number of March Parameters 12, 2010 2 / 32

Background Traditional variable selection procedures (AIC, BIC and etc.) use a fixed penalty on the size of a model. Some new variable selection procedures suggest the use of a data adaptive penalty to replace fixed penalties. All the above procedures follow stepwise and subset selection procedures are computationally intensive, hard to derive sampling properties, and unstable. Most convex penalties produce shrinkage estimators of parameters that make trade-offs between bias and variance such as those in smoothing splines. However, they can create unnecessary biases when the true parameters are large and parsimonious models cannot be produced. Jianqing Fan and Heng Peng Presenter: JialeNonconcave Xu () Penalized Likelihood with A Diverging Number of March Parameters 12, 2010 3 / 32

Background Fan and Li (2001) proposed a unified approach via nonconcave penalized least squares to automatically and simultaneously select variables and estimate the coefficients of variables. This method not only retains the good features of both subset selection and ridge regression, but also produces sparse solutions, ensures continuity of the selected models and has unbiased estimates for large coefficients. This is achieved by choosing suitable penalized nonconcave functions such as the smoothly clipped absolute deviation (SCAD) by Fan (1997). Other penalized least squares such as LASSO can also be studied under this unified work. The nonconcave penalized least-squares approach also corresponds to a Bayesian model selection with an improper prior and can be extended to likelihood-based models in various contexts. Jianqing Fan and Heng Peng Presenter: JialeNonconcave Xu () Penalized Likelihood with A Diverging Number of March Parameters 12, 2010 4 / 32

Nonconcave penalized likelihood Let log f (V, β) be the likelihood for a random vector V. This includes the likelihood of the form L(X T β, Y ) of the generalized linear model. Let p λ ( β j ) be a nonconcave penalized function that is indexed by a regularization parameter λ. The penalized likelihood estimator then maximizes n p (1.1) log f (V i, β) p λ ( β j ) i=1 The parameter λ can be chosen by cross-validation. Various algorithms have been proposed to optimize such a high-dimensional nonconcave likelihood function, such as the modified Newton-Raphson algorithm by Fan and Li (2001). j=1 Jianqing Fan and Heng Peng Presenter: JialeNonconcave Xu () Penalized Likelihood with A Diverging Number of March Parameters 12, 2010 5 / 32

Oracle estimator If there were an oracle assisting us in selecting variables, then we would select variables only with nonzero coefficients and apply the MLE to this submodel and estimate the remaining coefficients as 0. This ideal estimator is called an oracle estimator. Jianqing Fan and Heng Peng Presenter: JialeNonconcave Xu () Penalized Likelihood with A Diverging Number of March Parameters 12, 2010 6 / 32

A review on Fan and Li (2001) For the finite parameter case, Fan and Li (2001) established an oracle property. It demonstrated that penalized likelihood estimators are asymptotically as efficient as this ideal oracle estimator for certain penalty functions, such as SCAD and the hard thresholding penalty. It also proposed a sandwich formula for estimating the standard error of the estimated nonzero coefficients and empirically verifying the consistency of the formula. It laid down important groundwork on variable selection problems, but their theoretical results are limited to the finite-parameter setting. While their results are encouraging, the fundamental problems with a growing number of parameters have not been addressed. Jianqing Fan and Heng Peng Presenter: JialeNonconcave Xu () Penalized Likelihood with A Diverging Number of March Parameters 12, 2010 7 / 32

Objective in this paper The objectives in this paper are to investigate the following asymptotic properties of a nonconcave penalized likelihood estimator. (Oracle property.) Under certain conditions of the likelihood function and penalty functions, if p n does not grow too fast, then by the proper choice of λ n there exists a penalized likelihood estimator such that ˆβ n2 = 0 and ˆβ n1 behaves the same as the case in which β n2 = 0 is known in advance. (Asymptotic normality) As the length of ˆβ n1 depends on n, we will show that its arbitrary linear combination A n ˆβ n1 is asymptotically normal, where A n is a q s n matrix for any finite q. (Consistency of the sandwich formula.) Let ˆΣ n be an estimated covariance matrix for ˆβ n1, then it is consistent in the sense that A T n ˆΣ n A n converges to the asymptotic covariance matrix of A n ˆβn1. (Likelihood ratio theory.) If one tests the linear hypothesis H 0 : A n β n1 = 0 and uses the twice-penalized likelihood ratio statistic, then this statistic asymptotically follows a χ 2 distribution. Jianqing Fan and Heng Peng Presenter: JialeNonconcave Xu () Penalized Likelihood with A Diverging Number of March Parameters 12, 2010 8 / 32

Penalty function 3 principles for a good penalty function: unbiasedness, sparsity and continuity. Consider a simple form of (1.1): 1 2 (z θ)2 + p λ ( θ ). L 2 -penalty p λ ( θ ) = λ θ 2 (ridge regression) and L q -penalty (q > 1) reduce variability via shrinking the solutions, but do not have the properties of sparsity. L 1 -penalty p λ ( θ ) = λ θ (soft thresholding rule) and L q -penalty (q < 1) functions result in sparse solutions, but cannot keep the estimators unbiased for large parameters. Hard thresholding penalty function p λ ( θ ) = λ 2 ( θ λ) 2 I ( θ < λ) results in the hard thresholding rule ˆθ = zi ( z > λ), but the estimator is not continuous in the data z. SCAD penalty satisfies all the three properties and is defined by p λ = λ{i (θ λ) + (aλ θ) + I (θ > λ)} a I Jianqing Fan and Heng Peng Presenter: JialeNonconcave Xu () Penalized Likelihood with A Diverging Number of March Parameters 12, 2010 9 / 32

Regularity condition on penalty Let a n = max 1 j pn {p λ ( β n0j )}, β n0j 0 and b n = max 1 j pn {p λ ( β n0j )}, β n0j 0. Then we need to place the following conditions on the penalty functions: (A) lim inf n + lim inf θ 0+ p λ n (θ)/λ n > 0; (B) a n = O(n 1/2 ); (B ) a n = o(1/ np n ); (C) b n 0 as n + ; (C ) b n = o p (1/ p n ); (D) there are constants C and D such that, when θ 1, θ 2 > Cλ n, p λn (θ 1) p λn (θ 2) D θ 1 θ 2. Jianqing Fan and Heng Peng Presenter: JialeNonconcave Xu () Penalized Likelihood with A Diverging Number March of Parameters 12, 2010 10 / 32

Regularity condition on penalty Remark (A) makes the penalty function singular at the origin so that the PLE possess the sparsity property. (B) and (B ) ensure the unbiasedness property for large parameters and the existence of the root-n-consistent penalized likelihood estimator. (C) and (C ) guarantee that the penalty function does not have much more influence than the likelihood function on the penalized likelihood estimators. (D) is a smoothness condition that is imposed on the nonconcave penalty functions. Under (H) all the above are satisfied by SCAD and hard thresholding penalty, as a n = 0 and b n = 0 when n is large enough. Jianqing Fan and Heng Peng Presenter: JialeNonconcave Xu () Penalized Likelihood with A Diverging Number March of Parameters 12, 2010 11 / 32

Regularity condition on likelihood functions Due to the diverging number of parameters, some conditions have to be strengthened to keep uniform properties for the likelihood functions and sample series, compared to the conditions in the finite parameter case. (E) For every n the observations {V ni, i = 1, 2,..., n} are iid with the density f n (V n1, β n ), which has a common support, and the model is identifiable. Furthermore, the following holds: and E βn { log f n(v n1, β n ) β nj } = 0 for j = 1, 2,, p n (1) E βn { log f n(v n1, β n ) β nj log f n (V n1, β n ) β nk } = E βn { 2 log f n (V n1, β n ) β nj β nk } (2) (F) The Fisher information matrix I n (β n ) = E[{ log f n(v n1, β n ) β n }{ log f n(v n1, β n ) β n } T ] (3) Jianqing Fan and Heng Peng Presenter: JialeNonconcave Xu () Penalized Likelihood with A Diverging Number March of Parameters 12, 2010 12 / 32

Regularity condition on likelihood functions(cont.) satisfies conditions 0 < C 1 < λ min {I n (β n )} λ max {I n (β n )} < C 2 < for all n, and, for j, k = 1, 2,, p n, and E βn { log f n(v n1, β n ) β nj log f n (V n1, β n ) β nk } 2 < C 3 < (4) E βn { 2 log f n (V n1, β n ) β nj β nk } 2 < C 4 < (5) (G) There is a large enough open subset ω n of Ω n R pn which contains the true parameter point β n, such that for almost all V ni the density admits all third derivatives for all β n ω n. Furthermore, there are functions M njkl such that log fn(v n1,β n) β nj β nk β nl M njkl (H) Let the values of β n01, β n01,, β n0sn be the nonzero and β n0(sn+1), β n02,, β n0pn be zero. Then min 1 j sn β n0j /λ n as n. Jianqing Fan and Heng Peng Presenter: JialeNonconcave Xu () Penalized Likelihood with A Diverging Number March of Parameters 12, 2010 13 / 32

Regularity condition on likelihood functions Remark Under (F) and (G), the second and fourth moments of the likelihood function are imposed. (H) is necessary for obtaining the oracle property. It shows the rate at which the penalized likelihood can distinguish nonvanishing parameters from 0. Its zero component can be relaxed as max sn+1 j p n β n0j /λ n 0 as n. Jianqing Fan and Heng Peng Presenter: JialeNonconcave Xu () Penalized Likelihood with A Diverging Number March of Parameters 12, 2010 14 / 32

Oracle properties Recall that V ni, i = 1,, n,are iid r.v s with density f n (V n, β n0 ). Let L n (β n ) = n log f n (V ni, β n ) i=1 be the log-likelihood function and let be the penalized likelihood function. Q n (β n ) = L n (β n ) n p λn ( β nj ) p n j=1 Jianqing Fan and Heng Peng Presenter: JialeNonconcave Xu () Penalized Likelihood with A Diverging Number March of Parameters 12, 2010 15 / 32

Oracle properties Theorem 1 Suppose that the density f n (V n, β n0 ) satisfies conditions (E)-(G), and the penalty function p λn ( ) satisfies conditions (B)-(D). If p 4 n/n 0 as n, then there is a local maximizer ˆβ n of Q(β n ) such that ˆβ n β n0 = O p { p n (n 1/2 + a n )}. Remark If a n satisfies condition (B), that is, a n = O(n 1/2 ), then there is a root-(n/p n )-consistent estimator. For a SCAD or hard thresholding penalty, and condition (H) is satisfied by the model, we have a n = 0 when n is large enough. The root-(n/p n )-consistent penalized likelihood estimator exists with probability tending to 1, and no requirements are imposed on the convergence rate of λ n. Jianqing Fan and Heng Peng Presenter: JialeNonconcave Xu () Penalized Likelihood with A Diverging Number March of Parameters 12, 2010 16 / 32

Oracle properties Proof Let α n = p n (n 1/2 + a n ) and set u = C, where C is a large constant. Our aim is to show that for any given ɛ there is a large C such that for large n we have P{ sup Q n (β n0 + α n u) < Q n (β n0 )} 1 ɛ. u =C This implies that with probability tending to 1 there is a local maximum ˆβ n in the ball {β n0 + α n u : u C} such that ˆβ n β n0 = O p (α n ). Using p λn (0) = 0, we have D n (u) = Q n (β n0 + α n u) Q n (β n0 ) L n (β n0 + α n u) L n (β n0 ) n = (I ) + (II ) s n j=1 {p λn ( β n0j + α n u j ) p λn ( β n0j )} Jianqing Fan and Heng Peng Presenter: JialeNonconcave Xu () Penalized Likelihood with A Diverging Number March of Parameters 12, 2010 17 / 32

Oracle properties Proof(cont.) Then by Taylor s expansion we obtain (I ) = α n T L n (β n0 )u + 1 2 ut 2 L n (β n0 )uα 2 n + 1 6 T {u T 2 L n (β n)u}uα 3 n = I 1 + I 2 + I 3, where the vector β n lies between β n0 and β n0 + α n u, and (II ) = n s n j=1 [nα n p λ n ( β n0j )sgn(β n0j )u j +nα 2 np λ n (β n0j )u 2 j {1+o(1)}] = I 4 +I 5 We could show that all terms I 1, I 3, I 4 and I 5 are dominated by I 2 = nα2 n 2 ut I n (β n0 )u + o p (1) nαn u 2 2, which is negative. This completes the proof of Theorem 1. Jianqing Fan and Heng Peng Presenter: JialeNonconcave Xu () Penalized Likelihood with A Diverging Number March of Parameters 12, 2010 18 / 32

Oracle properties Lemma 1 Assume that conditions (A) and (E)-(H) are satisfied. If λ n 0, n/p n λ n and p 5 n/n 0 as n, then with probability tending to 1, for any given β n1 satisfying β n1 β n01 = O p ( p n /n) and any constant C, Q{(β T n1, 0) T } = max Q{(βn1, T βn2) T T } β n2 C(p n/n) 1/2 Jianqing Fan and Heng Peng Presenter: JialeNonconcave Xu () Penalized Likelihood with A Diverging Number March of Parameters 12, 2010 19 / 32

Oracle properties Denote and Σ λn = diag{p λ n (β n01 ),, p λ n (β n0sn )} b n = {p λ n ( β n01 )sgn(β n01 ),, p λ n ( β n0sn )sgn(β n0sn )} T. Theorem 2 Under conditions (A)-(H) are satisfied, if λ n 0, n/p n λ n and p 5 n/n 0 as n, then with probability tending to 1, the root-(n/p n )-consistent local maximizer ˆβ n = ( ˆβn1 ˆβn2) in Theorem 1 must satisfy:(i) Sparsity: ˆβn2 = 0. (ii) Asymptotic normality: nan In 1/2 (β n01 ){I n (β n01 )+Σ λn } [ ˆβ n1 β n01 +{I n (β n01 )+Σ λn } 1 b n ] D N (0, G), where A n is a q s n matrix such that A n A T n G, and G is a q q nonegative symmetric matrix. Jianqing Fan and Heng Peng Presenter: JialeNonconcave Xu () Penalized Likelihood with A Diverging Number March of Parameters 12, 2010 20 / 32

Oracle properties Proof As shown in Theorem 1, there is a root-n/p n -consistent local maximizer ˆβ n of Q n (β n ). By Lemma 1, part (i) holds that ˆβ n has the form ( ˆβ n1, 0) T. We need only prove (ii), the asymptotic normality of the penalized nonconcave likelihood estimator ˆβ n1. If we can show that then {I n (β n01 ) + Σ λn }( ˆβ n1 β n01 ) + b n = 1 n L n(β n01 ) + o p (n 1/2 ). nan I 1/2 n (β n01 ){I n (β n01 ) + Σ λn }[ ˆβ n1 β n01 + {I n (β n01 ) + Σ λn } 1 b n ] = 1 n A n I 1/2 n (β n01 ) L n (β n01 ) + o p {A n I 1/2 n (β n01 )}. By conditions of Theorem 2, we have the last term of o p (1). Jianqing Fan and Heng Peng Presenter: JialeNonconcave Xu () Penalized Likelihood with A Diverging Number March of Parameters 12, 2010 21 / 32

Oracle properties Proof (Cont.) Let Y ni = 1 n A n I 1/2 n (β n01 ) L ni (β n01 ), i = 1, 2,, n. It follows that, for any ɛ, n E Y ni 2 1{ Y ni > ɛ} n{e Y n1 4 } 1/2 {P( Y n1 > ɛ)} 1/2. i=1 Then we can show P( Y n1 > ɛ) = O(n 1 ) and E Y n1 4 = O( p2 n ). Thus, n 2 we have n i=1 E Y ni 2 1{ Y ni > ɛ} = O(n pn 1 n n ) = o(1). On the other hand we have n i=1 cov(y ni) G. Thus, Y ni satisfies the conditions of Lindeberg-Feller CLT. This also means 1 that n A n In 1/2 (β n01 ) L n (β n01 ) has an asymptotic multivariate normal distribution. Jianqing Fan and Heng Peng Presenter: JialeNonconcave Xu () Penalized Likelihood with A Diverging Number March of Parameters 12, 2010 22 / 32

Oracle properties Remark By Theorem 2 the sparsity and the asymptotic normality are still valid when the number of parameters diverges. The oracle property holds for the SCAD and the hard thresholding penalty function. When n is large enough, Σ λn = 0 and b n = 0 for the SCAD and the hard thresholding penalty. Hence, Theorem 2(ii) becomes nan I 1/2 n (β n01 )( ˆβ 01 β n01 ) D N (0, G) Jianqing Fan and Heng Peng Presenter: JialeNonconcave Xu () Penalized Likelihood with A Diverging Number March of Parameters 12, 2010 23 / 32

Estimation of covariance matrix As in Fan and Li (2001), by the sandwich formula let ˆΣ n = n{ 2 L n ( ˆβ n1 ) nσ λn ( ˆβ n1 )} 1 ĉov{ L n ( ˆβ n1 )}{ 2 L n ( ˆβ n1 ) nσ λn ( ˆβ n1 )} 1 be the estimated covariance matrix of ˆβ n1, where ĉov{ L n ( ˆβ n1 )} = { 1 n n i=1 L ni ( ˆβ n1 ) β j L ni ( ˆβ n1 ) β k } { 1 n n i=1 L ni ( ˆβ n1 ) }{ 1 β j n n i=1 L ni ( ˆβ n1 ) β k } Denote by Σ n = {I n (β n01 + Σ λn (β n01 )} 1 I n (β n01 ){I n (β n01 + Σ λn (β n01 )} 1 the asymptotic variance of ˆβ n1 in Theorem 2(ii). Jianqing Fan and Heng Peng Presenter: JialeNonconcave Xu () Penalized Likelihood with A Diverging Number March of Parameters 12, 2010 24 / 32

Consistency of the sandwich formula Theorem 3 If conditions (A)-(H) are satisfied and pn/n 5 0 as n, then we have A n ˆΣ n A T n A n Σ n A T p n 0 n (6) for any q s n matrix A n such that A n A T n = G, where q is any fixed integer. Theorem 3 not only proves a conjecture of Fan and Li (2001) about the consistency of the sandwich formula for the standard error matrix, but also extends the result to the situation with a growing number of parameters. It offers a way to construct a confidence interval for the esimtates of parameters. Jianqing Fan and Heng Peng Presenter: JialeNonconcave Xu () Penalized Likelihood with A Diverging Number March of Parameters 12, 2010 25 / 32

Likelihood ratio test Consider the problem of testing linear hypotheses: H 0 : A n β n01 = 0 vs. H 1 : A n β n01 0 where A n is a q s n matrix and A n A T n = I q for a fixed q. A natural ratio test for the problem is T n = 2{sup Q(β n V) sup Q(β n V)}. Ω n Ω n,a nβ n1 =0 Theorem 4 When conditions (A)-(H), (B ) and (C )are satisfied, under H 0 we have T n D χ 2 q (7) provided that p 5 n/n 0 as n. Jianqing Fan and Heng Peng Presenter: JialeNonconcave Xu () Penalized Likelihood with A Diverging Number March of Parameters 12, 2010 26 / 32

Simulation study Consider the autoregressive model: X i = β 1 X i 1 + β 2 X i 2 + + β p X i pn + ɛ, i = 1, 2,, n, where β = (11/4, 23/6, 37/12, 13/9, 1/3, 0,, 0) T and ɛ is white noise with variance σ 2. In the simulation experiments 400 samples of sizes 100, 200, 400 and 800 with p n = [4n 1/4 ] 5 are from this model. The SCAD penalty is employed. The results are summarized in the following tables. Jianqing Fan and Heng Peng Presenter: JialeNonconcave Xu () Penalized Likelihood with A Diverging Number March of Parameters 12, 2010 27 / 32

Simulation Results Jianqing Fan and Heng Peng Presenter: JialeNonconcave Xu () Penalized Likelihood with A Diverging Number March of Parameters 12, 2010 28 / 32

Simulation Results Jianqing Fan and Heng Peng Presenter: JialeNonconcave Xu () Penalized Likelihood with A Diverging Number March of Parameters 12, 2010 29 / 32

Simulation Results Jianqing Fan and Heng Peng Presenter: JialeNonconcave Xu () Penalized Likelihood with A Diverging Number March of Parameters 12, 2010 30 / 32

Summary In most model selection problems the number of parameters should be large and grow with the sample size. Some asymptotic properties of the nonconcave penalized likelihood are established for situations in which the number of parameters tends to as the sample size increases. Under regularity conditions an oracle property and asymptotic normality of the PLE are established. Consistency of the sandwich formula of the covariance matrix is demonstrated. And nonconcave penalized likelihood ratio statistics are discussed. Jianqing Fan and Heng Peng Presenter: JialeNonconcave Xu () Penalized Likelihood with A Diverging Number March of Parameters 12, 2010 31 / 32

Thank You! Jianqing Fan and Heng Peng Presenter: JialeNonconcave Xu () Penalized Likelihood with A Diverging Number March of Parameters 12, 2010 32 / 32