On the Oracle Property of the Adaptive LASSO in Stationary and Nonstationary Autoregressions. Anders Bredahl Kock. CREATES Research Paper

Size: px

Start display at page:

Download "On the Oracle Property of the Adaptive LASSO in Stationary and Nonstationary Autoregressions. Anders Bredahl Kock. CREATES Research Paper"

Sandra Day
6 years ago
Views:

1 On the Oracle Property of the Adaptive LASSO in Stationary and Nonstationary Autoregressions Anders Bredahl Kock CREAES Research Paper 0-05 Department of Economics and Business Aarhus University Bartholins Allé 0 DK-8000 Aarhus C Denmark oekonomi@au.dk el:

2 ON HE ORACLE PROPERY OF HE ADAPIVE LASSO IN SAIONARY AND NONSAIONARY AUOREGRESSIONS ANDERS BREDAHL KOCK AARHUS UNIVERSIY AND CREAES Abstract. We show that the Adaptive LASSO is oracle efficient in stationary and non-stationary autoregressions. his means that it estimates parameters consistently, selects the correct sparsity pattern, and estimates the coefficients belonging to the relevant variables at the same asymptotic efficiency as if only these had been included in the model from the outset. In particular this implies that it is able to discriminate between stationary and non-stationary autoregressions and it thereby constitutes an addition to the set of unit root tests. However, it is also shown that the Adaptive LASSO has no power against shrinking alternatives of the form c/ where c is a constant and the sample size if it is tuned to perform consistent model selection. We show that if the Adaptive LASSO is tuned to performed conservative model selection it has power even against shrinking alternatives of this form. Monte Carlo experiments reveal that the Adaptive LASSO performs particularly well in the presence of a unit root while being at par with its competitors in the stationary setting. Keywords: Adaptive LASSO, Oracle efficiency, Consistent model selection, Conservative model selection, autoregression, shrinkage. AMS 000 classification: 6F7, 6F0, 6F, 6J07 JEL classification: C3, C. Introduction Variable selection in high-dimensional systems has received a lot of attention in the statistics literature in the recent 0-5 years or so and it is also becoming increasingly popular in econometrics. As traditional computational methods are computationally infeasible if the number of covariates is large, focus has been on penalized or shrinkage type of estimators of which the most famous is probably the LASSO of ibshirani 996. his paper sparked a flurry of research in the theoretical properties of LASSO-type estimators, the first of which were Knight and Fu 000. Subsequently, many other shrinkage estimators have been analyzed: the SCAD of Fan and Li 00, the Bridge and Marginal Bridge Estimator in Huang et al. 008, the Dantzig selector of Candes and ao 007 and the Sure Independence Screening of Fan and Lv 008 to mention just a few. For a recent review with particular emphasis on the LASSO see Bühlmann and Van De Geer 0. he focus of these papers is to establish the so-called oracle property for the proposed estimators. his entails showing that the estimators are consistent, perform correct variable selection and establishing that the limiting distribution of the non-zero coefficients is the same as if only the Date: February, 0. Part of this research was carried out while I was visiting the Australian National University. I wish to thank ue Gørgens and the Research School of Economics for inviting me and creating a pleasant environmnent. I am also indebted to Svend Erik Graversen and Jørgen Hoffmann Jørgensen for help. I wish to thank CREAES, funded by the Danish National Research Foundation, for providing financial support.

3 ANDERS BREDAHL KOCK AARHUS UNIVERSIY AND CREAES relevant variables had been included in the model. Put differently, the inference is as efficient as if an oracle had revealed the true model to us and estimation had been carried out using only the relevant variables. Most focus in the statistics literature has been on establishing the oracle property for cross sectional data. An exception is Wang et al. 007 who consider the LASSO for stationary autoregressions while Kock 0 has shown that the oracle efficiency of the Bridge and Marginal Bridge estimator carry over to linear random and fixed effects panel data settings. In this paper we show that the Adaptive LASSO of Zou 006 possesses the oracle property in stationary as well as nonstationary autoregressions. We focus on the Adaptive LASSO since the original LASSO is only oracle efficient under rather restrictive assumptions which exclude too high dependence an assumption which is unlikely to be satisfied in time series models. We shall consider a model of the form p y t = ρ y t + βj y t j + ɛ t j= which is sometimes called a Dickey-Fuller regression. ɛ t is the error term to be discussed further in the next section. is said to have a unit root if ρ = 0. When testing for a unit root, one usually first determines the number of lagged differences p to be included. his can be done either by information criteria, or modifications hereof, Ng and Perron 00. Having selected the lags one tests whether ρ = 0. he oracle efficient estimators create new possibilities of carrying out such tests since testing for a unit root is basically a variable selection problem: Is y t to be left out of the model ρ = 0, or not? Hence, establishing the oracle property for the Adaptive LASSO means that we can choose the number of lagged differences to be included and leaving out irrelevant intermediate lags and test for a unit root at the same time. Knight and Fu 0 made this point and have used it to construct a unit root test based on the Bridge Estimator in the setting we shall call conservative model selection. We show: i he Adaptive LASSO possesses the oracle property in stationary and nonstationary autoregressions. ii Carry out out detailed finite sample and local to unity analysis in the stationary, nonstationary and local to unity setting. he local to unity setting reveals that the Adaptive LASSO is not exempt from the critique by Leeb and Pötscher 005, 008 of consistent model selection techniques. iii his problem, due to nonuniformity in the asymptotics, can be alleviated if one is willing to use tune the Adaptive LASSO to perform conservative model selection instead of consistent model selection. he properties of conservative model selection are investigated in the stationary as well as the nonstationary setting. he plan of the paper is as follows. Section introduces the Adaptive LASSO and some notation. Section 3 states the oracle theorems for the Adaptive LASSO for stationary and nonstationary autoregressions while Section 4 carries out a detailed finite sample and local analysis under various settings. Section 5 considers the properties of the Adaptive LASSO when tuned to perform conservative model selection, 6 contains some Monte Carlos and Section 7 concludes. All proofs are deferred to the Appendix. We shall make mathematically precise definitions of consistent and conservative model selection in Section.

4 HE ADAPIVE LASSO AND AUOREGRESSIONS 3. Setup and Notation he most famous shrinkage estimator is without doubt the LASSO the Least Absolute Shrinkage and Selection Operator. Its popularity is due to the fact that it carries out variable selection and parameter estimation in one step. However, it has been shown that the LASSO is only Oracle efficient under rather strict conditions, see Meinshausen and Bühlmann 006, Zhao and Yu 006 and Zou 006 which don t allow too high correlations between covariates. his motivates using other procedures for variable selection than the LASSO. In particular, the problem of the LASSO is that it penalizes all parameters equally. Hence, Zou 006 proposed the Adaptive LASSO which applies more intelligent data-driven penalization and proves that it is oracle efficient in a fixed regressor setting. In our context the Adaptive LASSO is defined as the minimizer of the following objective function. Ψ ρ, β = γ, γ > 0 y t ρy t t= p β j y t j j= + λ w γ ρ + λ p w γ j j= βj, where w = / ˆρ I and w j = / ˆβ I,j for j =,..., p and ˆρ I and ˆβ I,j denote some initial estimator of the parameters in. We shall use the least squares estimator in this paper but other estimators can be used as well. Hence, the Adaptive LASSO minimizes the least squares objective function plus a penalty term which penalizes parameters that are different from 0. Due to this extra penalty term the minimizers of are shrunk towards zero compared to the least squares estimator hence the name shrinkage estimator. he size of the shrinkage depends on the penalty term, which in turn depends on the initial least squares estimates: the smaller the initial estimate, the larger the penalty and the more likely it is that the Adaptive LASSO shrinks the parameter exactly to zero. he size of the penalty also depends on the sequence λ which must be chosen in an appropriate manner in order to get the oracle efficiency. In particular, λ must grow fast enough to shrink the estimates of truly zero parameters to zero, but slow enough in order not to introduce asymptotic bias in the estimators of the non-zero coefficients. he details are given in Section 3. In this paper we don t include deterministic components such as constants and trends to focus on the main idea of consistent and conservative model selection in stationary and nonstationary autoregressions. However, deterministics could be handled using standard detrending ideas, see e.g. Hamilton Notation. We shall consider + p observations from a time series y t generated by. A = { j p : β j 0 } denotes the active set of lagged differences, i.e. those lagged differences with non-zero coefficients. Let z t = y t,..., y t p be the p vector of lagged differences and let x t = y t, z t denote the vector of all covariates. Let Σ = Ez t z t and let Σ A denote the matrix that has picked out all elements in columns and rows indexed by A. So if p = 5 and A =, 3, 4, Σ A equals the 3 3 matrix that has picked out rows and columns,3 and 4 out of the 5 5 matrix Σ. Similarly, A indexes vectors by picking out the elements with index in A. Of course the actual value of this expectations depends on whether yt is stationary or not. However, irrespective of this z t is stationary.

5 4 ANDERS BREDAHL KOCK AARHUS UNIVERSIY AND CREAES Let y = y,..., y, y = y,..., y 0 and y j = y j,..., y j, j =,..., p 3. Let X = y, y,..., y p be the p + matrix of covariates and ɛ = ɛ,..., ɛ the vector of error terms. Let Z be a p vector such that Z N p 0, σ Σ where σ = Eɛ t. Furthermore, W r r=0 denotes the standard Wiener process on [0, ]. S = diag,,..., denotes a p + p + scaling matrix, denotes weak convergence convergence in law and p convergence in probability. Let ˆρ, ˆβ denote the minimizer of. Of course ˆρ, ˆβ depends on but to keep notation simple we suppress this in the sequel. Where no confusion arises this is also done for other quantities. For any x R n, x l = n i= x i denotes the standard Euclidean l norm stemming from the inner product < x, x >= n i= x i. Letting M 0 denote the true model and ˆM the estimated one, we shall say that a procedure is consistent if for all ρ, β, P ˆM = M 0. A procedure is said to be conservative if for all ρ, β, P M 0 ˆM 0, i.e. the probability of excluding relevant variables tends to zero. 3. Oracle results his section establishes and discusses the oracle property of the Adaptive LASSO for stationary as well as nonstationary autoregressions. he results open the possibility to use the Adapative LASSO to distinguish between these two and hence the Adaptive LASSO can also be seen as a new way of testing for unit roots. We begin with the nonstationary case: heorem Consistent model selection under nonstationarity. Assume that ρ = 0 and that ɛ t is i.i.d with Eɛ = 0 and Eɛ 4 λ <. hen, if, λ and λ γ / γ / 0 [ /. Consistency: S ˆρ, ˆβ 0, β ] l O p. Oracle i: P ˆρ = 0 and P ˆβ A c = 0 3. Oracle ii: ˆβ A β A N 0, σ [Σ A ] λ enables us to set ˆρ = 0 with probability tending to one if ρ = 0. Likewise, γ λ is needed to shrink the estimates of truly zero β / γ / j s to zero. Notice that both conditions require λ to grow sufficiently fast, i.e. the size of the penalty term must be sufficiently large to shrink the estimates of the zero parameters to zero. 0 on the other hand tells us that λ can not grow too fast. For if λ grows too fast even non-zero parameters will be shrunk to zero. In order for all three conditions to be satisfied simultaneously we need γ > / and γ > 0. It is of interest that the requirements on γ and γ are not the same. he reason for this difference is that ˆρ I converges at a rate of / while ˆβ j converges at a rate of /. heorem states that ˆρ and ˆβ are estimated consistently at rates / and /, respectively. Furthermore, the estimators of the zero coefficients don t only converge to zero in probability they are set exactly equal to zero with probability tending to one. Hence, the Adaptive LASSO performs variable selection and consistent estimation the correct sparsity pattern is detected asymptotically. Finally, the asymptotic distribution of the nonzero β j s is the same as if the true model had been known and only the relevant variables those with nonzero coefficients had been included and least squares applied to that model. In other words, the Adaptive LASSO possesses the oracle property: λ / 3 he dependence on is suppressed for some of the quantities where no confusion arises.

6 HE ADAPIVE LASSO AND AUOREGRESSIONS 5 It sets all parameters that are zero exactly equal to zero and the asymptotic distribution of the estimators of the non-zero coefficients is the same as if only the relevant variables had been included in the model. So the Adaptive LASSO performs as well as if an oracle had revealed the correct sparsity pattern prior to estimation. his sounds too good to be true and in some sense it is as we shall see in section 4. he assumption that ɛ t is i.i.d can be relaxed as long as S X X S and S X ɛ converge weakly. Of course the limits might change but since we establish that the Adaptive LASSO is asymptotically equivalent to the least squares estimator only including the relevant variables we would still conclude that it performs as well as if the true sparsity pattern had been known. Next, we consider the Adaptive LASSO for stationary autoregressions. heorem Consistent model selection under stationarity. Assume that ρ, 0 and that ɛ t is i.i.d with Eɛ = 0 and Eɛ 4 λ <. hen, if and λ / γ / 0 /. Consistency: [ ˆρ, ˆβ ρ, β ] l O p. Oracle i: P ˆρ = 0 0 and P ˆβ A c = 0 ˆρ ρ 3. Oracle ii: ˆβA βa N 0, σ [Q,A+ ] where Q = Ex t x t of dimension p + p + 4 Since the assumptions on λ are a subset of those made in heorem, the resulting requirements on γ and γ are a fortiori satisfied. he i.i.d assumption on ɛ t can be relaxed as in the nonstationary setting as long as X X ɛ converges weakly. As in heorem the Adaptive LASSO gives consistent parameter estimates and as usual for stationary autoregressions the rate of convergence of ˆρ is slowed down to the standard / rate compared to / in the nonstationary case. he probability of falsely classifying ˆρ = 0 tends to 0. As in heorem all irrelevant lagged differences will be classified as such with probability tending to one. heorem and show that the Adaptive LASSO can perform oracle efficient variable selection and estimation in stationary and nonstationary autoregressions. he theorems also suggest that the Adaptive LASSO can be used to discriminate between stationary and nonstationary autoregressions converges in probability and X and so opens the possibility to use the Adaptive LASSO for unit root testing. performance will be investigated in Section Finite sample and local analysis he practical In this section we analyze the finite sample behavior of the Adaptive LASSO. his is most conveniently done in the setting of an AR to keep the focus on the main points and the expressions simple. he expressions we obtain for the finite sample selection probabilities also allow us to describe the local behavior of the Adaptive LASSO easily. We give a complete description for all possible limiting values of the regularization parameter λ. Since γ = is in accordance with heorems and, i.e. we obtain the oracle property in the stationary as well as the nonstationary setting for this choice of γ, we shall focus on this value in 4 In accordance with previous notation, Q,A+ is the matrix consisting of all rows and columns with indexes in the set, A + where the addition is to be understood elementwise.

7 6 ANDERS BREDAHL KOCK AARHUS UNIVERSIY AND CREAES the sequel. Similar calculations can be made for other admissible values of γ. Since there are no lagged differences γ is redundant in this setting. o be precise, we will consider the model 3 y t = ρ y t + ɛ t where ɛ t can be quite general. In particular it just needs to allow for a central limit theorem to apply in the stationary case ρ, 0 and a functional central limit theorem to apply in the unit root as well as the local to unity setting. Appropriate assumptions can be found in Phillips 987a and Phillips 987b who allows for quite general dependence structures in the sequence {ɛ t }. In the following we shall assume that {ɛ t } is i.i.d. but keep in mind that the results carry over to much more general assumptions on {ɛ t }. ρ is estimated by minimizing 4 Lρ = y t ρy t ρ + λ ˆρI t= which is the AR equivalent to except for a factor of in front of λ whose only purpose is to make expressions simpler. Without any confusion, we let ˆρ denote the minimizer of 4 and ˆρ I the least squares estimate of ρ. he first theorem gives the exact finite sample probability of setting ˆρ equal to zero. heorem 3. Let y t = ρ y t + ɛ t and let ˆρ denote the minimizer of 4. hen P ˆρ = 0 = P ρ t= + y t ɛ t + ρ t= y t ɛ t y t= y t t= y t λ t heorem 3 characterizes the exact finite sample probability of setting ˆρ to zero. It is sensible that this is an increasing function in the regularization/shrinkage parameter λ since the larger λ is, the larger is the shrinkage. he following theorems quantify the asymptotic behavior of P ˆρ = 0 in case of unit root, stationarity, and 3 local to unity behavior in 3. he first theorem is concerned with the nonstationary case where ρ = 0. Hence, we would like to classify ˆρ = 0. his of course requires λ to be sufficiently large. heorem 4. Let y t = ρ y t + ɛ t with ρ = 0 and let ˆρ denote the minimizer of 4. If λ 0 then P ˆρ = 0 0 If λ λ 0, then P ˆρ = 0 p 0, 3 If λ then P ˆρ = 0 heorem 4 reveals that in the presence of a unit root, λ yields consistent model selection, i.e. P ˆρ = 0. If λ tends to a finite constant ˆρ has mass at 0 in the limit but the mass is not one. If λ 0, ˆρ is asymptotically equivalent to the least squares estimator and in fact even ˆρ is asymptotically equivalent to times the least squares estimator. Since this does not have any mass at 0 in the limit it is sensible that P ˆρ = 0 0 when λ 0. In particular, ˆρ equals the least squares estimator if λ = 0 for every < which can be seen from 4. Since ρ = 0, heorem 4 does not impose any restrictions on λ in order to obtain conservative model selection since there are no relevant variables to be excluded. t=

8 HE ADAPIVE LASSO AND AUOREGRESSIONS 7 he next theorem concerns the stationary case. In this case we do not want ˆρ to possess any mass at zero asymptotically. his naturally restricts the rate at which λ can increase as seen below. heorem 5. Let y t = ρ y t + ɛ t with ρ, 0 and let ˆρ denote the minimizer of 4 If λ / 0 then P ˆρ = 0 0{ 0 if ρ Ey If λ / λ then P ˆρ = 0 t > λ if ρ Eyt < λ 3 If λ / then P ˆρ = 0 Part of heorem 5 shows that in order for ˆρ not to possess any mass at 0 asymptotically, it is sufficient that λ o. If λ/, then ˆρ will be set to zero with probability tending to one even though ρ 0 as can be seen from part 3 of the theorem. Part of heorem 5 is markedly different from part of heorem 4. his is due to the fact that the random variable which decides whether ˆρ is to be classified as zero or not converges to a point mass at ρ Eyt in the stationary setting while it converges to a nondegenerate distribution in the nonstationary setting. his constant is the knife edge on which P ˆρ = 0 switches between 0 and. See the appendix for details. Also notice, that no classification is possible when λ = ρ Eyt in the stationary setting since ρ Eyt is a discontinuity point of the limiting distribution of the variable that decides whether ˆρ is to be classified as zero or not. We suspect that P ˆρ = 0 depends not only on λ = ρ Eyt but on the concrete fashion in which λ converges to ρ Eyt. aken together, theorems 4 and 5 show that for the Adaptive LASSO to act as a consistent model selection procedure it is sufficient that λ by heorem 4 and λ / λ for some λ < ρ Eyt by heorem 5. Since ρ 0 in heorem 5, λ = 0 works in particular. Hence, λ = α is admissible for all 0 < α <. In order to make the Adaptive LASSO work as a conservative models selection device, heorem 4 does not pose any restrictions on λ since ρ = 0 in that theorem so there are no relevant variables to be excluded. Hence, the only requirement is λ / λ for some λ < ρ Eyt by heorem 5. Note that in particular λ a 0 or λ = 0 for all work in this setting. λ = 0 amounts to no shrinkage at all and hence a zero probability of excluding relevant variables. hese two limiting values of λ are ruled out by consistent model selection and will play a crucial role in highlighting the difference between consistent and conservative model selection in heorem 6 below. Next, compare the requirements on λ from theorems 4 and 5 for consistent model selection λ and λ / λ < ρ Eyt to those resulting from heorems and with γ = γ =. Note that λ in both groups of theorems. However, heorems and are more restrictive on the growth rate of λ in that they require λ / 0 while heorems 4 and 5 only require λ / 0. his is not too surprising since heorems and deliver more. hey yield consistent model selection, consistent parameter estimation as well as the oracle efficient distribution. It is not hard to show that the requirements made in theorems 4 and 5 are also sufficient to yield consistent parameter estimation. However, only requiring λ / is not enough to ensure that ˆρ is asymptotically equivalent to times the least squares estimator when ρ 0 since we penalize and hence shrink too much. he result is that ˆρ no longer obtains the oracle efficient distribution asymptotically. λ / is needed to obtain the oracle efficient distribution. Knight and Fu 000 made a similar observation in a deterministic cross sectional setting. he next theorem concerns the local to unity situation. So far all results have been for pointwise asymptotics. he local to unity setting is a harder test for am estimator of ρ in the sense that it

9 8 ANDERS BREDAHL KOCK AARHUS UNIVERSIY AND CREAES must perform well on a sequence of shrinking alternatives instead of only at a single point in the parameter space. heorem 6. Let y t = ρ y t + ɛ t with ρ = c/ for some c 0 and let ˆρ denote the minimizer of 4. If λ 0 then P ˆρ = 0 0 If λ λ 0, then P ˆρ = 0 p 0, 3 If λ then P ˆρ = 0 Consistent model selection requires that λ by heorem 4. By part 3 in heorem 6 this implies that P ˆρ = 0. Hence, the Adaptive LASSO tuned to perform consistent model selection has no power against deviations from 0 of the form c/. his is a negative result since c/ 0 for all. his poor local performance is the flip side of shrinkage estimators tuned to perform consistent model selection which is reminiscent of Hodge s estimator, see e.g Lehmann and Casella 998. his phenomenon has already been observed by Leeb and Pötscher 005, 008; Pötscher and Leeb 009 in a different context. Recall however, that λ is required in heorems, and 4 in order to achieve enough shrinkage to obtain consistent model selection. he price paid for a shrinkage of this size is that even parameters that are local to zero at a rate of O/ will be shrunk to zero with probability tending to one. he Adaptive LASSO tuned to perform consistent model selection has no power against such alternatives. On the other hand, the Adaptive LASSO tuned to perform conservative model selection does have power against deviations from 0 of this form. his becomes clear from parts and of heorem 6 since λ 0 and λ λ 0, are both in line with conservative model selection see the discussion between theorems 5 and 6. If λ λ [0, then the probability of setting ˆρ exactly equal to zero no longer tends to one for ρ = c/. However, by heorem 4, the same is the case when ρ = 0. If this tradeoff is preferred to consistent model selection, the next section gives the properties in the of the Adaptive LASSO in the ARp model when tuned to perform conservative model selection. 5. Conservative model selection We continue to consider the case γ = γ = since this keeps expressions simple. When tuned to perform conservative model selection the properties of the Adaptive LASSO are as follows. heorem 7 Conservative model selection under nonstationarity. Assume that ρ = 0 and that ɛ t is i.i.d with Eɛ = 0 and Eɛ 4 <. Let γ = γ = 5. hen, if λ λ [0, S [ ˆρ, ˆβ 0, β ] arg min Ψu which implies S [ ˆρ, ˆβ 0, β ] l O p where 5 his assumption is not essential at all. It is only made to ensure λ γ = λ λ don t have to deal with different cases for the size of γ and λ. γ / / γ / = λ λ such that we

10 Ψu = u Au Bu + λ u C with A = HE ADAPIVE LASSO AND AUOREGRESSIONS 9 + λ p u j j= C j { } βj =0 σ p j= β j 0 W r dr 0 0 Σ, B = σ p j= β j 0 W rdw r Z C p j= β j 0 W sdw s 0 W and C j N 0, σ Σ j,j s ds heorem 7 reveals that ˆρ, ˆβ converges at the same rate as the leas squares estimator but to the minimizer of Ψu. Note that no shrinkage is applied to u j for j A which is a desirable property. For λ = 0, heorem 7 reveals that the asymptotic distribution of the Adaptive LASSO estimator is identical to the minimizer of u Au Bu. his in turn reveals that in this case the limit law of the Adaptive LASSO estimator is identical to the one of the least squares estimator in the model including all variables. his result is of course not surprising since λ = 0 implies that asymptotically there is no penalty on nonzero parameters and hence no shrinkage which implies that the objective function of the Adaptive LASSO,, approaches the least squares objective function. he absence of shrinkage also implies that no parameters will be set exactly equal to 0 or, more precisely, the probability of a parameter being set to 0 is 0. If λ 0,, the penalty terms do no longer vanish asymptotically except for the nonzero β j. Hence, with positive probability 6 the minimizer of Ψu has entries with value zero. Next, consider conservative model selection in the stationary case heorem 8 Conservative model selection under stationarity. Assume that ρ, 0 and that ɛ t is i.i.d with Eɛ = 0 and Eɛ 4 <. Let γ = γ = 7. hen, if λ λ [0, [ ˆρ, ˆβ ρ, β ] arg min Ψu which implies where Ψu = u Qu Bu + λ p u j j= C { } j βj =0 with [ ˆρ, ˆβ ρ, β ] l O p B N p+ 0, σ Q and C j N p+ 0, σ Q +j,+j 6 Actually calculating this probability seems to be non-trivial. 7 As in heorem 7 this assumption is not essential at all. It is only made to ensure λ γ = λ such that we don t have to deal with different cases for the size of λ γ and λ γ / / γ / = λ λ

11 0 ANDERS BREDAHL KOCK AARHUS UNIVERSIY AND CREAES As in the nonstationary case ˆρ, ˆβ converges at the same rate as the least squares estimator but to the minimizer of Ψu. Note that no shrinkage is applied to u j for j A. More importantly, no shrinkage is applied to u since now ρ 0. Similar to the nonstationary case ˆρ, ˆβ converges to the same limit as the least squares estimator if λ = 0. A particular instance of this is of course λ = for all in which case the Adaptive LASSO estimate is equal to the least squares estimate so their limiting laws are a fortiori identical. 6. Monte Carlo his section illustrates the above results by means of Monte Carlo experiments. he Adaptive LASSO is implemented by means of the algorithm proposed in Zou 006. Its performance is compared to the LASSO implemented by the LARS algorithm of Efron et al. 004 using the publicly available package at cran.r-project.org. Furthermore, a comparison is made to the BIC only selecting over the lagged differences, i.e. y t is always included in the model. Using the model chosen by BIC an Augmented Dickey-Fuller test is carried out for the presence of a unit root at a 5% significance level. he results for this procedure are denoted BICDF. Finally, all these procedures are compared to the OLS Oracle OLSO which carries out least squares only including the relevant variables. he Adaptive LASSO is implemented with γ = γ = γ = 0.5,, 0. γ = 0.5 is included since it is in the lowest end of the values of γ which are in accordance with theorems and. We also experimented with values of γ larger than 0 but the performance of the Adaptive LASSO was not improved by these. Finally, the Adaptive LASSO was also implemented by selecting γ by BIC from the above values. he above procedures are compared along the following dimensions. Sparsity pattern: How often does the procedure detect the correct sparsity pattern, i.e. how often does it include all relevant variables while not including any irrelevant ones? Unit root: How often does the procedure make the correct decision on inclusion/exclusion of y t? Or put differently, how well do the procedures classify whether ρ = 0 or not. 3 Relevant included: How often does the procedure include all relevant variables in the model? Even though the correct sparsity pattern is not detected it is of interest to know whether the procedure at least does not exclude any relevant variables from the model. Or in the jargon of the previous sections does the procedure at least perform conservative model selection. 4 Loss: How accurate does the estimated model predict on a hold out sample? Here we generate data from the same data generating process as used for the specification and estimation and use the estimated parameters to make predictions on this hold out sample. o gauge the performance along the above dimensions we carry out the following experiments with sample sizes of = 00 and 000. he number of Monte Carlo replications is 000 in all cases. Experiment A: ρ = 0 and β = 0.4, 0.3, 0., 0, 0, 0, 0, 0, 0, 0. A unit root setting with three relevant lagged differences. Experiment B: ρ = 0.05 and β = 0.4, 0.3, 0., 0, 0, 0, 0, 0, 0, 0. A stationary, but close to unit root setting with three relevant lagged differences. his should be a challenging setting since the data generating process is stationary but still close to the unit root setting which makes it harder to classify ρ 0. Experiment C: ρ = 0 and β = 0.4, 0.3, 0., 0, 0, 0, 0., 0, 0, 0., 0, 0. his experiment is carried out to investigate how well the methods fare when there is a gap in the lag structure.

12 HE ADAPIVE LASSO AND AUOREGRESSIONS 6.. Results. he best performer in terms of choosing the correct sparsity pattern in Experiment A when = 00 is the Adaptive LASSO with γ = 0 see able. It is also superior when it comes to correctly classifying ρ = 0 or ρ 0 it always makes the correct classification. his is better than the BICDF which classifies ρ correctly in 90% of the instances. he parameter estimates by the Adaptive LASSO using γ = 0 also yield the best out of sample predictive accuracy lowest Loss, outperforming all other procedures except for the infeasible OLS Oracle. It is seen that classifying ρ correctly plays an important role in obtaining a low Loss since procedures with low success in classifying ρ incur the biggest losses BIC and LASSO and vice versa. For the Adaptive LASSO no choice of γ retains all the relevant variables more than 50% of the time and interestingly γ = 0 is the worst choice along this dimension. Among all procedures, the LASSO does by far the best job at retaining relevant variables when = 00. For = 000 the Adaptive LASSO performs as well as the OLS Oracle along all dimensions underscoring its oracle property. It chooses the correct sparsity pattern almost every time except when γ = 0.5 always classifies ρ correctly. Experiment A also underscores that the LASSO is not a consistent variable selection procedure as the sample size increases the fraction of correctly selected sparsity patterns remains constant. =00 =000 Experiment A Adaptive LASSO BIC BICDF LASSO γ = 0.5 γ = γ = 0 BIC OLSO Sparsity Pattern Unit Root Relevant Retained Loss Sparsity Pattern Unit Root Relevant Retained Loss able. ρ = 0 and β = 0.4, 0.3, 0., 0, 0, 0, 0, 0, 0, 0 he first thing one notices in Experiment B able is that now γ =.5 is the preferred value of γ for the Adaptive LASSO. his is in opposition to the nonstationary setting in Experiment A where γ = 0 yielded the best performance. It is also interesting that the LASSO is actually the strongest performer when =00. It selects the correct sparsity pattern most often, classifies ρ correctly, and retains all relevant variable in about 80% of the cases. However, its lack of variable selection consistency is underscored by the fact that the fraction of times it selects the correct sparsity pattern does not tend to one even as the sample size increases 8. In this stationary setting it seems less important to classify ρ correctly in order to achieve a low Loss on the hold out sample since all procedures incur roughly the same Loss. his is in opposition to Experiment A. For the Adaptive LASSO as well as the BIC all quantities approach the ones of the OLS Oracle as the sample size increases illustrating the oracle efficiency of these estimators. he two procedures perform about equally well in this stationary setting. he poor performance of the Adaptive LASSO for γ = 0 may seem to be in opposition to heorem but for = results not reported 8 he estimations were also carried out for = not reported here and even then the LASSO only detected the correct sparsity pattern in 47% of the instances.

13 ANDERS BREDAHL KOCK AARHUS UNIVERSIY AND CREAES here the Adaptive LASSO detects the correct sparsity pattern in 9% of the instances even with this value of γ. =00 =000 Experiment B Adaptive LASSO BIC BICDF LASSO γ = 0.5 γ = γ = 0 BIC OLSO Sparsity Pattern Unit Root Relevant Retained Loss Sparsity Pattern Unit Root Relevant Retained Loss able. ρ = 0.05 and β = 0.4, 0.3, 0., 0, 0, 0, 0, 0, 0, 0 As in Experiment A the data generating process possesses a unit root in Experiment C. Considering the results for =00 in able 3 the findings from Experiment A are roughly confirmed. Choosing γ = 0 for the Adaptive LASSO seems to be a wise choice in the unit root setting even with gaps in the lag structure. he Adaptive LASSO always classifies ρ correctly with this value of γ and outperforms its closest competitor the adaptive LASSO using BIC to choose γ by almost 0%. By considering the Loss on the hold out sample it is also seen that a correct classification of ρ results in a big gain in predictive power in the presence of a unit root confirming the finding in Experiment A. As the sample size increases the the Adaptive LASSO and the BIC perform better while the performance of the LASSO does not get better underscoring the oracle property of the first two procedures and the lack of the same for the LASSO. Note that the BICDF can not be expected to classify ρ correctly more often than in 95% of the cases in the presence of a unit root since testing is carried out at a 5% significance level. On the other hand, the correct classification probability of the Adaptive LASSO tends to one. Furthermore, the BIC is considerably slower to implement since the number of regressions to be run increases exponentially in the number of potential explanatory variables. 7. Conclusion his paper has shown that the Adaptive LASSO can be tuned to perform consistent model selection in stationary and nonstationary autoregressions. he estimator of the parameters converges at the oracle efficient rate, i.e. as fast as if an oracle had revealed the true model prior to estimation and only the relevant variables had bee included in a least squares estimation. his enables us to use the Adaptive LASSO to distinguish between stationary and nonstationary autoregressions. However, the Adaptive LASSO has no power against alternatives in a shrinking neighborhood around 0 when tuned to perform consistent variable selection. his problem can be alleviated by tuning the Adaptive LASSO to perform conservative model selection. he price paid compared to consistent model selection is that truly zero parameters are no longer set to zero with probability tending to one but still with positive probability.

14 HE ADAPIVE LASSO AND AUOREGRESSIONS 3 =00 =000 Experiment C Adaptive LASSO BIC BICDF LASSO γ = 0.5 γ = γ = 0 BIC OLSO Sparsity Pattern Unit Root Relevant Retained Loss Sparsity Pattern Unit Root Relevant Retained Loss able 3. ρ = 0 and β = 0.4, 0.3, 0., 0, 0, 0, 0., 0, 0, 0., 0, 0 Monte Carlo experiments confirm that the Adaptive LASSO performs well compared to standard competitors. his is the case in particular for nonstationary data. 8. Appendix Proof of heorem. For the proof of this theorem we will need the following results which can be found in e.g. Hamilton S X X S S X ɛ σ p j= β j 0 W r dr 0 0 Σ σ p j= βj 0 W rdw r Z := B. := A, We shall also make [ use of the fact that the least squares estimator, ˆρ I, ˆβ I, of ρ, β in satisfies that S ˆρI, ˆβ I ρ, β ] l O p First, let u = u, u where u is a scalar and u a p vector. Set ρ = u / and β j = βj + u j/ which implies that as a function of u can be written as Ψ u = y u y + λ w γ u + λ p j= p βj + u j y j w γ j j= β j + u j. Let û = û, û = arg min Ψ u and notice that û = ˆρ and û j = ˆβ j β j for j =,..., p. Define l

15 4 ANDERS BREDAHL KOCK AARHUS UNIVERSIY AND CREAES V u = Ψ u Ψ 0 = u S X X S u u S X ɛ + λ w γ u + λ p j= w γ j β j + u j Consider the first two terms in the above display. It follows from 5 and 6 that β j. 7 Furthermore, u S X X S u u S X ɛ u Au u B. 8 λ w γ u = λ ˆρ I γ u = u { λ γ ˆρ I in probability if u 0 γ 0 in probability if u = 0 since ˆρ I is tight. Also, if β j 0 9 λ w γ j β j + u j β j = λ = λ / ˆβ I,j γ ˆβ I,j 0 in probability u j βj + u j β j γ u j βj + u j uj / β j uj / since i: λ / / 0, ii: / ˆβI,j γ /β j γ < in probability and iii: u j β j + uj β j / uj u j signβ j. Finally, if β j = 0, 0 λ w γ j β j + u j β j = λ / u γ j = ˆβ I,j { in probability if u j 0 0 in probability if u j = 0 λ / γ/ γ u j ˆβI,j λ since i: and ii: ˆβ / γ / I,j is tight. Putting together 7-0 one concludes: { u Au u B if u = 0 and u j = 0 for all j A c V u Ψu = if u 0 or u j 0 for some j A c Since V u is convex and Ψu has a unique minimum it follows from Knight 999 that arg min V u arg min Ψu. Hence,

16 HE ADAPIVE LASSO AND AUOREGRESSIONS 5 3 û δ 0 û A c δ Ac 0 û A N0, σ [Σ A ] where δ 0 is the Dirac measure at 0 and A c is the cardinality of A c hence, δ Ac 0 is the A c - dimensional Dirac measure at 0. Notice that and imply that û 0 in probability and û A c 0 in probability. An equivalent formulation of -3 is ˆρ δ 0 ˆβA c βa c δ Ac ˆβA β A N0, σ [Σ A ] 4-6 yield the consistency part of the theorem at the rate of for ˆρ and for ˆβ. Notice that this also implies that no ˆβ j, j A will be set equal to 0 since for all j A, ˆβj converges in probability to βj 0. 6 also yields the oracle efficient asymptotic distribution, i.e. part 3 of the theorem. It remains to show part of the theorem; P ˆρ = 0 and P ˆβ,A c = 0. Both proofs are by contradiction. First, assume ˆρ 0. hen the first order conditions for a minimum read: which is equivalent to y y Consider first the second term: y X ˆρ, ˆβ + λ w γ signˆρ = 0 y X ˆρ, ˆβ 0 + λ w γ signˆρ = 0 λ w γ signˆρ = λ γ γ in probability ˆρ I since ˆρ I is tight. For the first term one has: y = y ɛ By 6, y ɛ σ p j= β j by 5. Hence, y ɛ y X ˆρ, ˆβ and y X S = y y X S S [ˆρ, ˆβ β ] 0 W rdw r. Furthermore, y X S ɛ X S S [ˆρ, ˆβ β ] σ p j= β j 0 W r dr, 0,..., 0 are tight. We also know that S [ˆρ, ˆβ β ] converges weakly by 4-6 which implies it is tight as well. aken together, y y X ˆρ, ˆβ is tight and so

17 6 ANDERS BREDAHL KOCK AARHUS UNIVERSIY AND CREAES P ˆρ 0 P y y X ˆρ, ˆβ Next, assume ˆβ j 0 for j A c. From the first order conditions + λ w γ signˆρ = 0 0 or equivalently, y j y X ˆρ, ˆβ + λ w γ j sign ˆβ j = 0 y j First, consider the second term y X ˆρ, ˆβ / + λ w γ j sign ˆβ j / = 0 λ w γ j sign ˆβ j / = λ γ w j λ = / / γ/ / ˆβ I,j γ since ˆβ I,j is tight. Regarding the first term, y j = y j ɛ / y X ˆρ, ˆβ / = y j y j X S S [ˆρ, ˆβ β ] / ɛ X S S [ˆρ, ˆβ β ] By 6 y j ɛ N0, σ Σ j where in accordance with previous notation Σ j is the jth diagonal / y j X S y element of Σ. 0, Σ / j,,..., Σ j,p by 5. Hence, j ɛ and y j X S are tight. / / he same is the case for S [ˆρ, ˆβ β ] since it converges weakly by 4-6. aken together, y j y X ˆρ, ˆβ / is tight and so P ˆβ j 0 P y j y X ˆρ, ˆβ / / + λ w γ j signˆρ = 0 0 / Proof of heorem. he proof runs along the same lines as the proof of heorem. For the proof we will need 7 and 8 below which can be found in e.g. Hamilton 994, Chapter 8. Notice that by definition of x t = y t, z t the lower right hand p p block of Q is Σ. We shall make use of the following limit results:

18 HE ADAPIVE LASSO AND AUOREGRESSIONS 7 X p 7 X Q 8 X ɛ N p+ 0, σ Q := B where the definition of B means that B is a random vector distributed as N p+ 0, σ Q We shall also make use of the fact that the least squares estimator is consistent under stationarity, i.e. [ ˆρI, ˆβ I ρ, β ] l O p First, let u = u, u where u is a scalar and u a p vector. Set ρ = ρ + u / and β j = β j + u j/ and Ψ u = y + λ w γ ρ + u y ρ + u + λ p p j= w γ j j= βj + u j y j β j + u j Let û = û, û = arg min Ψ u and notice that û = ˆρ ρ and û j = ˆβ j β j for j =,..., p. Define l Ṽ u = Ψ u Ψ 0 9 = u X X u u X ɛ + λ w γ ρ + u ρ + λ p j= w γ j β j + u j Consider the first two terms in the above display. It follows from 7 and 8 that Furthermore, since ρ 0 u X X u u X ɛ u Qu u B β j 0 λ w γ ρ + u γ ρ = λ u ˆρ I ρ + u ρ = λ γ / ˆρ I u ρ + u ρ 0 in probability u / u / since i: λ / / 0, ii: /ˆρI γ /ρ γ < in probability and ρ iii: u + u ρ u / u signρ.

19 8 ANDERS BREDAHL KOCK AARHUS UNIVERSIY AND CREAES Similarly, if βj 0 λ w γ j β j + u j β j = λ = λ / ˆβ I,j γ ˆβ I,j 0 in probability u j βj + u j β j γ u j βj + u j since i: λ / / 0, ii: / ˆβI,j γ /β j γ < in probability and iii: u j β j + uj β j / uj u j signβ j. Finally, if β j = 0, λ w γ j λ β j + u j β j = λ / u γ j = ˆβ I,j { in probability if u j 0 0 in probability if u j = 0 λ / γ/ since i: and ii ˆβ / γ / I,j is tight. Putting 9- together one concludes: { u Qu u B if uj = 0 for all j A c Ṽ u Ψu = if u j 0 for some j A c uj / β j uj / γ u j ˆβI,j Since Ṽ u is convex and Ψu has a unique minimum it follows from Knight 999 that arg min Ṽ u arg min Ψu. Hence, 3 4 û, û A N 0, σ [Q,A+ ] û A c δ Ac 0 where δ 0 is the Dirac measure at 0 and A c is the cardinality of A c hence, δ Ac 0 is the A c - dimensional Dirac measure at 0. Notice that 4 implies that û A c 0 in probability. An equivalent formulation of 3 and 4 is 5 6 ˆρ ρ ˆβA βa N 0, σ [Q,A+ ] ˆβA c β A c δ Ac 0 5 and 6 establish the consistency part of the theorem at the oracle rate of. Note that this also implies that for no j A will ˆβj be set equal to 0 since for each j A, ˆβj converges in probability to βj 0. he same is true for ˆρ. 5 also yields the oracle efficient asymptotic distribution, i.e. part 3 of the theorem. It remains to show part of the theorem; P ˆβ A c = 0. he proof is by contradiction.

20 HE ADAPIVE LASSO AND AUOREGRESSIONS 9 Assume ˆβ j 0 for j A c. From the first order conditions or equivalently, y j y X ˆρ, ˆβ + λ w γ j sign ˆβ j = 0 y j First, consider the second term y X ˆρ, ˆβ / + λ w γ j sign ˆβ j / = 0 λ w γ j sign ˆβ j / = λ γ w j λ = / / γ/ / ˆβ I,j γ since ˆβ I,j is tight. Regarding the first term, y j y X ˆρ, ˆβ y j = / = y j ɛ y j X [ˆρ ρ, ˆβ β ] / ɛ X [ˆρ ρ, ˆβ β ] By 8, y j ɛ N0, σ Q / j+ where in accordance with previous notation Q j+ is the j + th diagonal element of Q. y j X p Q j+,,..., Q j+,p+ by 7. Hence, y j ɛ and y j X / are tight. he same is the case for [ˆρ ρ, ˆβ β ] since it converges weakly by 5-6. Hence, P ˆβ j 0 P y j y X ˆρ, ˆβ / / + λ w γ j signˆρ = 0 0 / Before proving heorem 3 we prove the following lemma. Let x + = maxx, 0. Lemma. Let g : R R be given by gu = u au + λ u, λ 0, a 0. hen arg min g = 0 if and only if λ a. More precisely, arg min g = signa a λ +. Proof. Assume a > 0. Since g u = u a + λsignu is strictly negative for u < 0 arg min g [0,. ũ > 0 is a local minimum and hence a global minimum since g is strictly convex if and only if it is a stationary point, i.e. g ũ = ũ a+λ = 0 which is equivalent to 0 < ũ = a λ = a λ. his contradicts λ a and so arg min g = 0 when λ a. In total, the above shows that arg min g = a λ + = signa a λ for a > 0. Similar arguments establish the result for + a < 0.

21 0 ANDERS BREDAHL KOCK AARHUS UNIVERSIY AND CREAES Proof of heorem 3. ˆρ minimizes Lρ = y t ρy t ρ + λ ˆρ I = t= which is equvalent to minimizing ρ t= ρ y ty t t= y t yt + ρ t= + λ ρ ˆρI t= y t It follows from Lemma that ˆρ = 0 if and only if t= y t ρ t= ρ y t y t + λ ˆρ I = ρ ρˆρ I + λ ρ ˆρI t= y t ˆρI λ ˆρ I t= y t ˆρ I yt λ Hence, recalling that ˆρ I = t= y ty t / t= y t the least squares estimator P ˆρ = 0 = P ˆρ I t= [ ] yt t= λ = P y ty t yt λ t= t= y t [ ] = P ρ t= + y t ɛ t yt λ = P ρ + t= y t [ t= y t ɛ t t= y t Proof of heorem 4. From Phillips 987a one has t= ] + ρ t= y t ɛ t t= y t t= yt λ t= 7 y t ɛ t, t= Using heorem 3 with ρ = 0 yields t= y t σ 0 W s dw s, σ 0 Ws ds P ˆρ = 0 = P [ t= y t ɛ t t= y t ] From 7 and the continuous mapping theorem it follows that yt λ t= 8 G := [ t= y t ɛ t t= y t ] yt 0 W sdw s 0 W σ W s s ds := G ds 0 t= where the last definition means that G is a random variable distributed as [ 0 WsdWs 0 W s ds ] σ 0 W s ds.

22 HE ADAPIVE LASSO AND AUOREGRESSIONS Case : λ 0. Since the right hand side in 8 is absolutely continuous with respect to the Lebesgue measure it has no mass points and so P ˆρ = 0 = P G λ = FG λ F G 0 = 0 if λ 0. Case :λ λ 0,. By the same reasoning as in case it follows that P ˆρ = 0 = P G λ = FG λ F G λ = p 0, since G is supported on all of R +. Case 3: λ. Since G converges weakly it is tight and the result follows. Proof of heorem 5. By standard results see e.g. Hamilton 994 / y t ɛ t N0, σ E[yt ] t= t= Using heorem 3 with ρ, 0 yields y t P ˆρ = 0 = P ρ + / t= y t ɛ t Since t= y t = P ρ + / t= y t ɛ t / t= yt ɛt t= y t which implies that H := ρ + t= y t is tight it follows that / t= y t ɛ t t= y t / t= y t ɛ t t= y t p Ey t + / ρ + / ρ / t= y t ɛ t t= y t / t= y t ɛ t t= y t + ρ / t= y t ɛ t / t= y t + / ρ / t= y t ɛ t t= y t p 0 Case : λ / 0. Since H converges in probability to L it follows that t= P ˆρ = 0 = P H λ / P H L L/ 0 y t yt λ t= t= y t λ p ρ Ey t := L

23 ANDERS BREDAHL KOCK AARHUS UNIVERSIY AND CREAES where the estimate holds for sufficiently large since λ / 0. Case : λ / λ 0,. P ˆρ = 0 = P H λ { P H λ + L λ/ P H L L λ/4 0, λ < L P H λ λ L/ P H L + λ L/4, λ > L where the first estimate in each of the cases holds from a certain step and onwards. Case 3: λ /. Since H converges in probability it is tight and the result follows. Proof of heorem 6. By Phillips 987b 9 y t ɛ t, t= t= y t σ 0 J c rdw r, σ 0 Jc rdr where J c is the Ornstein-Uhlenbeck process with parameter c. Notice how the only difference to 7 is that the integrand process now is J c r instead of W r. For c = 0 they are identical as the Ornstein-Uhlenbeck process collapses to the Wiener process. From heorem 3 with ρ = c/ it follows [ ] P ˆρ = 0 = P c/ t= + y t ɛ t t= + c/ y t ɛ t yt λ = P c + [ t= y t t= y ] t ɛ t + c t= y t By the continuous mapping theorem K := c + [ t= y t ɛ t t= y t c + 0 J crdw r 0 J c rdr ] + c t= y t t= y t ɛ t t= y t t= y t ɛ t t= y t t= yt λ t= yt t= 0 + c J crdw r 0 J σ J c c rdr := K rdr 0 where the last definition means that K is a random variable distributed as the weak limit of K. Case : λ 0. Since K is absolutely continuous with respect to the Lebesgue measure it has no mass points and so P ˆρ = 0 = P K λ = F K λ F K 0 = 0 Case : λ λ 0,. By the same reasoning as in Case it follows that P ˆρ = 0 = P K λ = F K λ F K λ 0, since K is supported on all of R +. Case 3: λ. Since K converges weakly it is tight and the result follows.

24 HE ADAPIVE LASSO AND AUOREGRESSIONS 3 Proof of heorem 7. he setting is the same as in the proof of heorem. Follow the proof of that theorem, with identical notation, until 7 with γ = γ =. Next, notice that 30 λ w u = λ ˆρ I by 5 and 6 since C has no mass at 0. Furthermore, if β j 0 λ w j βj + u j β j u = λ = λ / ˆβ I,j = u λ ˆρ I λ u C u j βj + u j u j β j + u j 3 0 in probability since i: λ / / 0, ii: / ˆβ I,j /β j < in probability and iii: u j β j + uj β j / uj u j signβ j. 3 Finally, if β j = 0, λ w j βj + u j β j = λ / ˆβ I,j ˆβ I,j β j uj / β j uj / u j = λ u j λ u j ˆβI,j C j by 5 and 6 since i: λ λ and ii: C j is 0 with probability 0 such that x /x is continuous almost everywhere with respect to the limiting measure. Putting together 7 and30-3 one concludes V u u Au u B + λ u p C + λ u j C j { } := Ψu βj =0 Hence, since V u is convex and Ψu has a unique minimum it follows from Knight 999 that arg min V u arg min Ψu Proof of heorem 8. he setting is the same as in the proof of heorem. Follow the proof of that theorem, with identical notation, until with γ = γ =. For the case of βj = 0 one has 33 λ w j βj + u j β j = λ / ˆβ I,j j= u j = λ u j λ u j ˆβI,j C j by 7 and 8 since i: λ λ, ii: Cj is 0 with probability 0 such that x /x is continuous almost everywhere with respect to the limiting measure. Putting together 9- and 33 one concludes Ṽ u u Qu u B + λ p j= u j C j { β j =0 } := Ψu

25 4 ANDERS BREDAHL KOCK AARHUS UNIVERSIY AND CREAES Hence, since Ṽ u is convex and Ψu has a unique minimum it follows from Knight 999 that arg min Ṽ u arg min Ψu References Bühlmann, P. and S. Van De Geer 0. Statistics for High-Dimensional Data: Methods, heory and Applications. Springer-Verlag, New York. Candes, E. and. ao 007. he dantzig selector: Statistical estimation when p is much larger than n. he Annals of Statistics 35, Efron, B.,. Hastie, I. Johnstone, and R. ibshirani 004. Least angle regression. he Annals of statistics 3, Fan, J. and R. Li 00. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association 96, Fan, J. and J. Lv 008. Sure independence screening for ultrahigh dimensional feature space. Journal of the Royal Statistical Society: Series B Statistical Methodology 70, Hamilton, J. D ime Series Analysis. Cambridge University Press, Cambridge. Huang, J., J. L. Horowitz, and S. Ma 008. Asymptotic properties of bridge estimators in sparse high-dimensional regression models. he Annals of Statistics 36, Knight, K Epi-convergence in distribution and stochastic equi-semicontinuity. Unpublished manuscript. Knight, K. and W. Fu 000. Asymptotics for lasso-type estimators. Annals of Statistics, Knight, K. and W. Fu 0. An alternative to unit root tests: Bridge estimators differentiate between nonstationary versus stationary models and select optimal lag. Working paper. Kock, A. B. 0. Oracle efficient variable selection in random and fixed effects panel data models. Econometric heory forthcoming. Leeb, H. and B. Pötscher 008. Sparse estimators and the oracle property, or the return of hodges estimator. Journal of Econometrics 4, 0. Leeb, H. and B. M. Pötscher 005. Model selection and inference: Facts and fiction. Econometric heory, 59. Lehmann, E. L. and G. Casella 998. heory of point estimation, Volume 3. Springer Verlag. Meinshausen, N. and P. Bühlmann 006. High-dimensional graphs and variable selection with the lasso. he Annals of Statistics 34, Ng, S. and P. Perron 00. Lag length selection and the construction of unit root tests with good size and power. Econometrica 69, Phillips, P. C. B. 987a. ime series regression with a unit root. Econometrica, Phillips, P. C. B. 987b. owards a unified asymptotic theory for autoregression. Biometrika 74, Pötscher, B. and H. Leeb 009. On the distribution of penalized maximum likelihood estimators: he lasso, scad, and thresholding. Journal of Multivariate Analysis 00, ibshirani, R Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B Methodological, Wang, H., G. Li, and C. L. sai 007. Regression coefficient and autoregressive order shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B Statistical Methodology 69,

26 HE ADAPIVE LASSO AND AUOREGRESSIONS 5 Zhao, P. and B. Yu 006. On model selection consistency of lasso. he Journal of Machine Learning Research 7, Zou, H he adaptive lasso and its oracle properties. Journal of the American Statistical Association 0,

Research Papers 0 0-43: Peter Christoffersen, Ruslan Goyenko, Kris Jacobs, Mehdi Karoui: Illiquidity Premia in the Equity Options Market 0-44: Diego Amaya, Peter Christoffersen, Kris Jacobs and

27 Research Papers : Peter Christoffersen, Ruslan Goyenko, Kris Jacobs, Mehdi Karoui: Illiquidity Premia in the Equity Options Market 0-44: Diego Amaya, Peter Christoffersen, Kris Jacobs and Aurelio Vasquez: Do Realized Skewness and Kurtosis Predict the Cross- Section of Equity Returns? 0-45: Peter Christoffersen and Hugues Langlois: he Joint Dynamics of Equity Market Factors 0-46: Peter Christoffersen, Kris Jacobs and Bo Young Chang: Forecasting with Option Implied Information 0-47: Kim Christensen and Mark Podolskij: Asymptotic theory of rangebased multipower variation 0-48: Christian M. Dahl, Daniel le Maire and Jakob R. Munch: Wage Dispersion and Decentralization of Wage Bargaining 0-49: orben G. Andersen, Oleg Bondarenko and Maria. Gonzalez-Perez: Coherent Model-Free Implied Volatility: A Corridor Fix for High- Frequency VIX 0-50: orben G. Andersen and Oleg Bondarenko: VPIN and the Flash Crash 0-5: im Bollerslev, Daniela Osterrieder, Natalia Sizova and George auchen: Risk and Return: Long-Run Relationships, Fractional Cointegration, and Return Predictability 0-5: Lars Stentoft: What we can learn from pricing 39,879 Individual Stock Options 0-53: Kim Christensen, Mark Podolskij and Mathias Vetter: On covariation estimation for multivariate continuous Itô semimartingales with noise in non-synchronous observation schemes 0-0: Matei Demetrescu and Robinson Kruse: he Power of Unit Root ests Against Nonlinear Local Alternatives 0-0: Matias D. Cattaneo, Michael Jansson and Whitney K. Newey: Alternative Asymptotics and the Partially Linear Model with Many Regressors 0-03: Matt P. Dziubinski: Conditionally-Uniform Feasible Grid Search Algorithm 0-04: Jeroen V.K. Rombouts, Lars Stentoft and Francesco Violante: he Value of Multivariate Model Sophistication: An Application to pricing Dow Jones Industrial Average options 0-05: Anders Bredahl Kock: On the Oracle Property of the Adaptive LASSO in Stationary and Nonstationary Autoregressions

Oracle Efficient Estimation and Forecasting with the Adaptive LASSO and the Adaptive Group LASSO in Vector Autoregressions

Oracle Efficient Estimation and Forecasting with the Adaptive LASSO and the Adaptive Group LASSO in Vector Autoregressions Anders Bredahl Kock and Laurent A.F. Callot CREAES Research Paper 2012-38 Department