On the Oracle Property of the Adaptive LASSO in Stationary and Nonstationary Autoregressions. Anders Bredahl Kock. CREATES Research Paper

Size: px
Start display at page:

Download "On the Oracle Property of the Adaptive LASSO in Stationary and Nonstationary Autoregressions. Anders Bredahl Kock. CREATES Research Paper"

Transcription

1 On the Oracle Property of the Adaptive LASSO in Stationary and Nonstationary Autoregressions Anders Bredahl Kock CREAES Research Paper 0-05 Department of Economics and Business Aarhus University Bartholins Allé 0 DK-8000 Aarhus C Denmark oekonomi@au.dk el:

2 ON HE ORACLE PROPERY OF HE ADAPIVE LASSO IN SAIONARY AND NONSAIONARY AUOREGRESSIONS ANDERS BREDAHL KOCK AARHUS UNIVERSIY AND CREAES Abstract. We show that the Adaptive LASSO is oracle efficient in stationary and non-stationary autoregressions. his means that it estimates parameters consistently, selects the correct sparsity pattern, and estimates the coefficients belonging to the relevant variables at the same asymptotic efficiency as if only these had been included in the model from the outset. In particular this implies that it is able to discriminate between stationary and non-stationary autoregressions and it thereby constitutes an addition to the set of unit root tests. However, it is also shown that the Adaptive LASSO has no power against shrinking alternatives of the form c/ where c is a constant and the sample size if it is tuned to perform consistent model selection. We show that if the Adaptive LASSO is tuned to performed conservative model selection it has power even against shrinking alternatives of this form. Monte Carlo experiments reveal that the Adaptive LASSO performs particularly well in the presence of a unit root while being at par with its competitors in the stationary setting. Keywords: Adaptive LASSO, Oracle efficiency, Consistent model selection, Conservative model selection, autoregression, shrinkage. AMS 000 classification: 6F7, 6F0, 6F, 6J07 JEL classification: C3, C. Introduction Variable selection in high-dimensional systems has received a lot of attention in the statistics literature in the recent 0-5 years or so and it is also becoming increasingly popular in econometrics. As traditional computational methods are computationally infeasible if the number of covariates is large, focus has been on penalized or shrinkage type of estimators of which the most famous is probably the LASSO of ibshirani 996. his paper sparked a flurry of research in the theoretical properties of LASSO-type estimators, the first of which were Knight and Fu 000. Subsequently, many other shrinkage estimators have been analyzed: the SCAD of Fan and Li 00, the Bridge and Marginal Bridge Estimator in Huang et al. 008, the Dantzig selector of Candes and ao 007 and the Sure Independence Screening of Fan and Lv 008 to mention just a few. For a recent review with particular emphasis on the LASSO see Bühlmann and Van De Geer 0. he focus of these papers is to establish the so-called oracle property for the proposed estimators. his entails showing that the estimators are consistent, perform correct variable selection and establishing that the limiting distribution of the non-zero coefficients is the same as if only the Date: February, 0. Part of this research was carried out while I was visiting the Australian National University. I wish to thank ue Gørgens and the Research School of Economics for inviting me and creating a pleasant environmnent. I am also indebted to Svend Erik Graversen and Jørgen Hoffmann Jørgensen for help. I wish to thank CREAES, funded by the Danish National Research Foundation, for providing financial support.

3 ANDERS BREDAHL KOCK AARHUS UNIVERSIY AND CREAES relevant variables had been included in the model. Put differently, the inference is as efficient as if an oracle had revealed the true model to us and estimation had been carried out using only the relevant variables. Most focus in the statistics literature has been on establishing the oracle property for cross sectional data. An exception is Wang et al. 007 who consider the LASSO for stationary autoregressions while Kock 0 has shown that the oracle efficiency of the Bridge and Marginal Bridge estimator carry over to linear random and fixed effects panel data settings. In this paper we show that the Adaptive LASSO of Zou 006 possesses the oracle property in stationary as well as nonstationary autoregressions. We focus on the Adaptive LASSO since the original LASSO is only oracle efficient under rather restrictive assumptions which exclude too high dependence an assumption which is unlikely to be satisfied in time series models. We shall consider a model of the form p y t = ρ y t + βj y t j + ɛ t j= which is sometimes called a Dickey-Fuller regression. ɛ t is the error term to be discussed further in the next section. is said to have a unit root if ρ = 0. When testing for a unit root, one usually first determines the number of lagged differences p to be included. his can be done either by information criteria, or modifications hereof, Ng and Perron 00. Having selected the lags one tests whether ρ = 0. he oracle efficient estimators create new possibilities of carrying out such tests since testing for a unit root is basically a variable selection problem: Is y t to be left out of the model ρ = 0, or not? Hence, establishing the oracle property for the Adaptive LASSO means that we can choose the number of lagged differences to be included and leaving out irrelevant intermediate lags and test for a unit root at the same time. Knight and Fu 0 made this point and have used it to construct a unit root test based on the Bridge Estimator in the setting we shall call conservative model selection. We show: i he Adaptive LASSO possesses the oracle property in stationary and nonstationary autoregressions. ii Carry out out detailed finite sample and local to unity analysis in the stationary, nonstationary and local to unity setting. he local to unity setting reveals that the Adaptive LASSO is not exempt from the critique by Leeb and Pötscher 005, 008 of consistent model selection techniques. iii his problem, due to nonuniformity in the asymptotics, can be alleviated if one is willing to use tune the Adaptive LASSO to perform conservative model selection instead of consistent model selection. he properties of conservative model selection are investigated in the stationary as well as the nonstationary setting. he plan of the paper is as follows. Section introduces the Adaptive LASSO and some notation. Section 3 states the oracle theorems for the Adaptive LASSO for stationary and nonstationary autoregressions while Section 4 carries out a detailed finite sample and local analysis under various settings. Section 5 considers the properties of the Adaptive LASSO when tuned to perform conservative model selection, 6 contains some Monte Carlos and Section 7 concludes. All proofs are deferred to the Appendix. We shall make mathematically precise definitions of consistent and conservative model selection in Section.

4 HE ADAPIVE LASSO AND AUOREGRESSIONS 3. Setup and Notation he most famous shrinkage estimator is without doubt the LASSO the Least Absolute Shrinkage and Selection Operator. Its popularity is due to the fact that it carries out variable selection and parameter estimation in one step. However, it has been shown that the LASSO is only Oracle efficient under rather strict conditions, see Meinshausen and Bühlmann 006, Zhao and Yu 006 and Zou 006 which don t allow too high correlations between covariates. his motivates using other procedures for variable selection than the LASSO. In particular, the problem of the LASSO is that it penalizes all parameters equally. Hence, Zou 006 proposed the Adaptive LASSO which applies more intelligent data-driven penalization and proves that it is oracle efficient in a fixed regressor setting. In our context the Adaptive LASSO is defined as the minimizer of the following objective function. Ψ ρ, β = γ, γ > 0 y t ρy t t= p β j y t j j= + λ w γ ρ + λ p w γ j j= βj, where w = / ˆρ I and w j = / ˆβ I,j for j =,..., p and ˆρ I and ˆβ I,j denote some initial estimator of the parameters in. We shall use the least squares estimator in this paper but other estimators can be used as well. Hence, the Adaptive LASSO minimizes the least squares objective function plus a penalty term which penalizes parameters that are different from 0. Due to this extra penalty term the minimizers of are shrunk towards zero compared to the least squares estimator hence the name shrinkage estimator. he size of the shrinkage depends on the penalty term, which in turn depends on the initial least squares estimates: the smaller the initial estimate, the larger the penalty and the more likely it is that the Adaptive LASSO shrinks the parameter exactly to zero. he size of the penalty also depends on the sequence λ which must be chosen in an appropriate manner in order to get the oracle efficiency. In particular, λ must grow fast enough to shrink the estimates of truly zero parameters to zero, but slow enough in order not to introduce asymptotic bias in the estimators of the non-zero coefficients. he details are given in Section 3. In this paper we don t include deterministic components such as constants and trends to focus on the main idea of consistent and conservative model selection in stationary and nonstationary autoregressions. However, deterministics could be handled using standard detrending ideas, see e.g. Hamilton Notation. We shall consider + p observations from a time series y t generated by. A = { j p : β j 0 } denotes the active set of lagged differences, i.e. those lagged differences with non-zero coefficients. Let z t = y t,..., y t p be the p vector of lagged differences and let x t = y t, z t denote the vector of all covariates. Let Σ = Ez t z t and let Σ A denote the matrix that has picked out all elements in columns and rows indexed by A. So if p = 5 and A =, 3, 4, Σ A equals the 3 3 matrix that has picked out rows and columns,3 and 4 out of the 5 5 matrix Σ. Similarly, A indexes vectors by picking out the elements with index in A. Of course the actual value of this expectations depends on whether yt is stationary or not. However, irrespective of this z t is stationary.

5 4 ANDERS BREDAHL KOCK AARHUS UNIVERSIY AND CREAES Let y = y,..., y, y = y,..., y 0 and y j = y j,..., y j, j =,..., p 3. Let X = y, y,..., y p be the p + matrix of covariates and ɛ = ɛ,..., ɛ the vector of error terms. Let Z be a p vector such that Z N p 0, σ Σ where σ = Eɛ t. Furthermore, W r r=0 denotes the standard Wiener process on [0, ]. S = diag,,..., denotes a p + p + scaling matrix, denotes weak convergence convergence in law and p convergence in probability. Let ˆρ, ˆβ denote the minimizer of. Of course ˆρ, ˆβ depends on but to keep notation simple we suppress this in the sequel. Where no confusion arises this is also done for other quantities. For any x R n, x l = n i= x i denotes the standard Euclidean l norm stemming from the inner product < x, x >= n i= x i. Letting M 0 denote the true model and ˆM the estimated one, we shall say that a procedure is consistent if for all ρ, β, P ˆM = M 0. A procedure is said to be conservative if for all ρ, β, P M 0 ˆM 0, i.e. the probability of excluding relevant variables tends to zero. 3. Oracle results his section establishes and discusses the oracle property of the Adaptive LASSO for stationary as well as nonstationary autoregressions. he results open the possibility to use the Adapative LASSO to distinguish between these two and hence the Adaptive LASSO can also be seen as a new way of testing for unit roots. We begin with the nonstationary case: heorem Consistent model selection under nonstationarity. Assume that ρ = 0 and that ɛ t is i.i.d with Eɛ = 0 and Eɛ 4 λ <. hen, if, λ and λ γ / γ / 0 [ /. Consistency: S ˆρ, ˆβ 0, β ] l O p. Oracle i: P ˆρ = 0 and P ˆβ A c = 0 3. Oracle ii: ˆβ A β A N 0, σ [Σ A ] λ enables us to set ˆρ = 0 with probability tending to one if ρ = 0. Likewise, γ λ is needed to shrink the estimates of truly zero β / γ / j s to zero. Notice that both conditions require λ to grow sufficiently fast, i.e. the size of the penalty term must be sufficiently large to shrink the estimates of the zero parameters to zero. 0 on the other hand tells us that λ can not grow too fast. For if λ grows too fast even non-zero parameters will be shrunk to zero. In order for all three conditions to be satisfied simultaneously we need γ > / and γ > 0. It is of interest that the requirements on γ and γ are not the same. he reason for this difference is that ˆρ I converges at a rate of / while ˆβ j converges at a rate of /. heorem states that ˆρ and ˆβ are estimated consistently at rates / and /, respectively. Furthermore, the estimators of the zero coefficients don t only converge to zero in probability they are set exactly equal to zero with probability tending to one. Hence, the Adaptive LASSO performs variable selection and consistent estimation the correct sparsity pattern is detected asymptotically. Finally, the asymptotic distribution of the nonzero β j s is the same as if the true model had been known and only the relevant variables those with nonzero coefficients had been included and least squares applied to that model. In other words, the Adaptive LASSO possesses the oracle property: λ / 3 he dependence on is suppressed for some of the quantities where no confusion arises.

6 HE ADAPIVE LASSO AND AUOREGRESSIONS 5 It sets all parameters that are zero exactly equal to zero and the asymptotic distribution of the estimators of the non-zero coefficients is the same as if only the relevant variables had been included in the model. So the Adaptive LASSO performs as well as if an oracle had revealed the correct sparsity pattern prior to estimation. his sounds too good to be true and in some sense it is as we shall see in section 4. he assumption that ɛ t is i.i.d can be relaxed as long as S X X S and S X ɛ converge weakly. Of course the limits might change but since we establish that the Adaptive LASSO is asymptotically equivalent to the least squares estimator only including the relevant variables we would still conclude that it performs as well as if the true sparsity pattern had been known. Next, we consider the Adaptive LASSO for stationary autoregressions. heorem Consistent model selection under stationarity. Assume that ρ, 0 and that ɛ t is i.i.d with Eɛ = 0 and Eɛ 4 λ <. hen, if and λ / γ / 0 /. Consistency: [ ˆρ, ˆβ ρ, β ] l O p. Oracle i: P ˆρ = 0 0 and P ˆβ A c = 0 ˆρ ρ 3. Oracle ii: ˆβA βa N 0, σ [Q,A+ ] where Q = Ex t x t of dimension p + p + 4 Since the assumptions on λ are a subset of those made in heorem, the resulting requirements on γ and γ are a fortiori satisfied. he i.i.d assumption on ɛ t can be relaxed as in the nonstationary setting as long as X X ɛ converges weakly. As in heorem the Adaptive LASSO gives consistent parameter estimates and as usual for stationary autoregressions the rate of convergence of ˆρ is slowed down to the standard / rate compared to / in the nonstationary case. he probability of falsely classifying ˆρ = 0 tends to 0. As in heorem all irrelevant lagged differences will be classified as such with probability tending to one. heorem and show that the Adaptive LASSO can perform oracle efficient variable selection and estimation in stationary and nonstationary autoregressions. he theorems also suggest that the Adaptive LASSO can be used to discriminate between stationary and nonstationary autoregressions converges in probability and X and so opens the possibility to use the Adaptive LASSO for unit root testing. performance will be investigated in Section Finite sample and local analysis he practical In this section we analyze the finite sample behavior of the Adaptive LASSO. his is most conveniently done in the setting of an AR to keep the focus on the main points and the expressions simple. he expressions we obtain for the finite sample selection probabilities also allow us to describe the local behavior of the Adaptive LASSO easily. We give a complete description for all possible limiting values of the regularization parameter λ. Since γ = is in accordance with heorems and, i.e. we obtain the oracle property in the stationary as well as the nonstationary setting for this choice of γ, we shall focus on this value in 4 In accordance with previous notation, Q,A+ is the matrix consisting of all rows and columns with indexes in the set, A + where the addition is to be understood elementwise.

7 6 ANDERS BREDAHL KOCK AARHUS UNIVERSIY AND CREAES the sequel. Similar calculations can be made for other admissible values of γ. Since there are no lagged differences γ is redundant in this setting. o be precise, we will consider the model 3 y t = ρ y t + ɛ t where ɛ t can be quite general. In particular it just needs to allow for a central limit theorem to apply in the stationary case ρ, 0 and a functional central limit theorem to apply in the unit root as well as the local to unity setting. Appropriate assumptions can be found in Phillips 987a and Phillips 987b who allows for quite general dependence structures in the sequence {ɛ t }. In the following we shall assume that {ɛ t } is i.i.d. but keep in mind that the results carry over to much more general assumptions on {ɛ t }. ρ is estimated by minimizing 4 Lρ = y t ρy t ρ + λ ˆρI t= which is the AR equivalent to except for a factor of in front of λ whose only purpose is to make expressions simpler. Without any confusion, we let ˆρ denote the minimizer of 4 and ˆρ I the least squares estimate of ρ. he first theorem gives the exact finite sample probability of setting ˆρ equal to zero. heorem 3. Let y t = ρ y t + ɛ t and let ˆρ denote the minimizer of 4. hen P ˆρ = 0 = P ρ t= + y t ɛ t + ρ t= y t ɛ t y t= y t t= y t λ t heorem 3 characterizes the exact finite sample probability of setting ˆρ to zero. It is sensible that this is an increasing function in the regularization/shrinkage parameter λ since the larger λ is, the larger is the shrinkage. he following theorems quantify the asymptotic behavior of P ˆρ = 0 in case of unit root, stationarity, and 3 local to unity behavior in 3. he first theorem is concerned with the nonstationary case where ρ = 0. Hence, we would like to classify ˆρ = 0. his of course requires λ to be sufficiently large. heorem 4. Let y t = ρ y t + ɛ t with ρ = 0 and let ˆρ denote the minimizer of 4. If λ 0 then P ˆρ = 0 0 If λ λ 0, then P ˆρ = 0 p 0, 3 If λ then P ˆρ = 0 heorem 4 reveals that in the presence of a unit root, λ yields consistent model selection, i.e. P ˆρ = 0. If λ tends to a finite constant ˆρ has mass at 0 in the limit but the mass is not one. If λ 0, ˆρ is asymptotically equivalent to the least squares estimator and in fact even ˆρ is asymptotically equivalent to times the least squares estimator. Since this does not have any mass at 0 in the limit it is sensible that P ˆρ = 0 0 when λ 0. In particular, ˆρ equals the least squares estimator if λ = 0 for every < which can be seen from 4. Since ρ = 0, heorem 4 does not impose any restrictions on λ in order to obtain conservative model selection since there are no relevant variables to be excluded. t=

8 HE ADAPIVE LASSO AND AUOREGRESSIONS 7 he next theorem concerns the stationary case. In this case we do not want ˆρ to possess any mass at zero asymptotically. his naturally restricts the rate at which λ can increase as seen below. heorem 5. Let y t = ρ y t + ɛ t with ρ, 0 and let ˆρ denote the minimizer of 4 If λ / 0 then P ˆρ = 0 0{ 0 if ρ Ey If λ / λ then P ˆρ = 0 t > λ if ρ Eyt < λ 3 If λ / then P ˆρ = 0 Part of heorem 5 shows that in order for ˆρ not to possess any mass at 0 asymptotically, it is sufficient that λ o. If λ/, then ˆρ will be set to zero with probability tending to one even though ρ 0 as can be seen from part 3 of the theorem. Part of heorem 5 is markedly different from part of heorem 4. his is due to the fact that the random variable which decides whether ˆρ is to be classified as zero or not converges to a point mass at ρ Eyt in the stationary setting while it converges to a nondegenerate distribution in the nonstationary setting. his constant is the knife edge on which P ˆρ = 0 switches between 0 and. See the appendix for details. Also notice, that no classification is possible when λ = ρ Eyt in the stationary setting since ρ Eyt is a discontinuity point of the limiting distribution of the variable that decides whether ˆρ is to be classified as zero or not. We suspect that P ˆρ = 0 depends not only on λ = ρ Eyt but on the concrete fashion in which λ converges to ρ Eyt. aken together, theorems 4 and 5 show that for the Adaptive LASSO to act as a consistent model selection procedure it is sufficient that λ by heorem 4 and λ / λ for some λ < ρ Eyt by heorem 5. Since ρ 0 in heorem 5, λ = 0 works in particular. Hence, λ = α is admissible for all 0 < α <. In order to make the Adaptive LASSO work as a conservative models selection device, heorem 4 does not pose any restrictions on λ since ρ = 0 in that theorem so there are no relevant variables to be excluded. Hence, the only requirement is λ / λ for some λ < ρ Eyt by heorem 5. Note that in particular λ a 0 or λ = 0 for all work in this setting. λ = 0 amounts to no shrinkage at all and hence a zero probability of excluding relevant variables. hese two limiting values of λ are ruled out by consistent model selection and will play a crucial role in highlighting the difference between consistent and conservative model selection in heorem 6 below. Next, compare the requirements on λ from theorems 4 and 5 for consistent model selection λ and λ / λ < ρ Eyt to those resulting from heorems and with γ = γ =. Note that λ in both groups of theorems. However, heorems and are more restrictive on the growth rate of λ in that they require λ / 0 while heorems 4 and 5 only require λ / 0. his is not too surprising since heorems and deliver more. hey yield consistent model selection, consistent parameter estimation as well as the oracle efficient distribution. It is not hard to show that the requirements made in theorems 4 and 5 are also sufficient to yield consistent parameter estimation. However, only requiring λ / is not enough to ensure that ˆρ is asymptotically equivalent to times the least squares estimator when ρ 0 since we penalize and hence shrink too much. he result is that ˆρ no longer obtains the oracle efficient distribution asymptotically. λ / is needed to obtain the oracle efficient distribution. Knight and Fu 000 made a similar observation in a deterministic cross sectional setting. he next theorem concerns the local to unity situation. So far all results have been for pointwise asymptotics. he local to unity setting is a harder test for am estimator of ρ in the sense that it

9 8 ANDERS BREDAHL KOCK AARHUS UNIVERSIY AND CREAES must perform well on a sequence of shrinking alternatives instead of only at a single point in the parameter space. heorem 6. Let y t = ρ y t + ɛ t with ρ = c/ for some c 0 and let ˆρ denote the minimizer of 4. If λ 0 then P ˆρ = 0 0 If λ λ 0, then P ˆρ = 0 p 0, 3 If λ then P ˆρ = 0 Consistent model selection requires that λ by heorem 4. By part 3 in heorem 6 this implies that P ˆρ = 0. Hence, the Adaptive LASSO tuned to perform consistent model selection has no power against deviations from 0 of the form c/. his is a negative result since c/ 0 for all. his poor local performance is the flip side of shrinkage estimators tuned to perform consistent model selection which is reminiscent of Hodge s estimator, see e.g Lehmann and Casella 998. his phenomenon has already been observed by Leeb and Pötscher 005, 008; Pötscher and Leeb 009 in a different context. Recall however, that λ is required in heorems, and 4 in order to achieve enough shrinkage to obtain consistent model selection. he price paid for a shrinkage of this size is that even parameters that are local to zero at a rate of O/ will be shrunk to zero with probability tending to one. he Adaptive LASSO tuned to perform consistent model selection has no power against such alternatives. On the other hand, the Adaptive LASSO tuned to perform conservative model selection does have power against deviations from 0 of this form. his becomes clear from parts and of heorem 6 since λ 0 and λ λ 0, are both in line with conservative model selection see the discussion between theorems 5 and 6. If λ λ [0, then the probability of setting ˆρ exactly equal to zero no longer tends to one for ρ = c/. However, by heorem 4, the same is the case when ρ = 0. If this tradeoff is preferred to consistent model selection, the next section gives the properties in the of the Adaptive LASSO in the ARp model when tuned to perform conservative model selection. 5. Conservative model selection We continue to consider the case γ = γ = since this keeps expressions simple. When tuned to perform conservative model selection the properties of the Adaptive LASSO are as follows. heorem 7 Conservative model selection under nonstationarity. Assume that ρ = 0 and that ɛ t is i.i.d with Eɛ = 0 and Eɛ 4 <. Let γ = γ = 5. hen, if λ λ [0, S [ ˆρ, ˆβ 0, β ] arg min Ψu which implies S [ ˆρ, ˆβ 0, β ] l O p where 5 his assumption is not essential at all. It is only made to ensure λ γ = λ λ don t have to deal with different cases for the size of γ and λ. γ / / γ / = λ λ such that we

10 Ψu = u Au Bu + λ u C with A = HE ADAPIVE LASSO AND AUOREGRESSIONS 9 + λ p u j j= C j { } βj =0 σ p j= β j 0 W r dr 0 0 Σ, B = σ p j= β j 0 W rdw r Z C p j= β j 0 W sdw s 0 W and C j N 0, σ Σ j,j s ds heorem 7 reveals that ˆρ, ˆβ converges at the same rate as the leas squares estimator but to the minimizer of Ψu. Note that no shrinkage is applied to u j for j A which is a desirable property. For λ = 0, heorem 7 reveals that the asymptotic distribution of the Adaptive LASSO estimator is identical to the minimizer of u Au Bu. his in turn reveals that in this case the limit law of the Adaptive LASSO estimator is identical to the one of the least squares estimator in the model including all variables. his result is of course not surprising since λ = 0 implies that asymptotically there is no penalty on nonzero parameters and hence no shrinkage which implies that the objective function of the Adaptive LASSO,, approaches the least squares objective function. he absence of shrinkage also implies that no parameters will be set exactly equal to 0 or, more precisely, the probability of a parameter being set to 0 is 0. If λ 0,, the penalty terms do no longer vanish asymptotically except for the nonzero β j. Hence, with positive probability 6 the minimizer of Ψu has entries with value zero. Next, consider conservative model selection in the stationary case heorem 8 Conservative model selection under stationarity. Assume that ρ, 0 and that ɛ t is i.i.d with Eɛ = 0 and Eɛ 4 <. Let γ = γ = 7. hen, if λ λ [0, [ ˆρ, ˆβ ρ, β ] arg min Ψu which implies where Ψu = u Qu Bu + λ p u j j= C { } j βj =0 with [ ˆρ, ˆβ ρ, β ] l O p B N p+ 0, σ Q and C j N p+ 0, σ Q +j,+j 6 Actually calculating this probability seems to be non-trivial. 7 As in heorem 7 this assumption is not essential at all. It is only made to ensure λ γ = λ such that we don t have to deal with different cases for the size of λ γ and λ γ / / γ / = λ λ

11 0 ANDERS BREDAHL KOCK AARHUS UNIVERSIY AND CREAES As in the nonstationary case ˆρ, ˆβ converges at the same rate as the least squares estimator but to the minimizer of Ψu. Note that no shrinkage is applied to u j for j A. More importantly, no shrinkage is applied to u since now ρ 0. Similar to the nonstationary case ˆρ, ˆβ converges to the same limit as the least squares estimator if λ = 0. A particular instance of this is of course λ = for all in which case the Adaptive LASSO estimate is equal to the least squares estimate so their limiting laws are a fortiori identical. 6. Monte Carlo his section illustrates the above results by means of Monte Carlo experiments. he Adaptive LASSO is implemented by means of the algorithm proposed in Zou 006. Its performance is compared to the LASSO implemented by the LARS algorithm of Efron et al. 004 using the publicly available package at cran.r-project.org. Furthermore, a comparison is made to the BIC only selecting over the lagged differences, i.e. y t is always included in the model. Using the model chosen by BIC an Augmented Dickey-Fuller test is carried out for the presence of a unit root at a 5% significance level. he results for this procedure are denoted BICDF. Finally, all these procedures are compared to the OLS Oracle OLSO which carries out least squares only including the relevant variables. he Adaptive LASSO is implemented with γ = γ = γ = 0.5,, 0. γ = 0.5 is included since it is in the lowest end of the values of γ which are in accordance with theorems and. We also experimented with values of γ larger than 0 but the performance of the Adaptive LASSO was not improved by these. Finally, the Adaptive LASSO was also implemented by selecting γ by BIC from the above values. he above procedures are compared along the following dimensions. Sparsity pattern: How often does the procedure detect the correct sparsity pattern, i.e. how often does it include all relevant variables while not including any irrelevant ones? Unit root: How often does the procedure make the correct decision on inclusion/exclusion of y t? Or put differently, how well do the procedures classify whether ρ = 0 or not. 3 Relevant included: How often does the procedure include all relevant variables in the model? Even though the correct sparsity pattern is not detected it is of interest to know whether the procedure at least does not exclude any relevant variables from the model. Or in the jargon of the previous sections does the procedure at least perform conservative model selection. 4 Loss: How accurate does the estimated model predict on a hold out sample? Here we generate data from the same data generating process as used for the specification and estimation and use the estimated parameters to make predictions on this hold out sample. o gauge the performance along the above dimensions we carry out the following experiments with sample sizes of = 00 and 000. he number of Monte Carlo replications is 000 in all cases. Experiment A: ρ = 0 and β = 0.4, 0.3, 0., 0, 0, 0, 0, 0, 0, 0. A unit root setting with three relevant lagged differences. Experiment B: ρ = 0.05 and β = 0.4, 0.3, 0., 0, 0, 0, 0, 0, 0, 0. A stationary, but close to unit root setting with three relevant lagged differences. his should be a challenging setting since the data generating process is stationary but still close to the unit root setting which makes it harder to classify ρ 0. Experiment C: ρ = 0 and β = 0.4, 0.3, 0., 0, 0, 0, 0., 0, 0, 0., 0, 0. his experiment is carried out to investigate how well the methods fare when there is a gap in the lag structure.

12 HE ADAPIVE LASSO AND AUOREGRESSIONS 6.. Results. he best performer in terms of choosing the correct sparsity pattern in Experiment A when = 00 is the Adaptive LASSO with γ = 0 see able. It is also superior when it comes to correctly classifying ρ = 0 or ρ 0 it always makes the correct classification. his is better than the BICDF which classifies ρ correctly in 90% of the instances. he parameter estimates by the Adaptive LASSO using γ = 0 also yield the best out of sample predictive accuracy lowest Loss, outperforming all other procedures except for the infeasible OLS Oracle. It is seen that classifying ρ correctly plays an important role in obtaining a low Loss since procedures with low success in classifying ρ incur the biggest losses BIC and LASSO and vice versa. For the Adaptive LASSO no choice of γ retains all the relevant variables more than 50% of the time and interestingly γ = 0 is the worst choice along this dimension. Among all procedures, the LASSO does by far the best job at retaining relevant variables when = 00. For = 000 the Adaptive LASSO performs as well as the OLS Oracle along all dimensions underscoring its oracle property. It chooses the correct sparsity pattern almost every time except when γ = 0.5 always classifies ρ correctly. Experiment A also underscores that the LASSO is not a consistent variable selection procedure as the sample size increases the fraction of correctly selected sparsity patterns remains constant. =00 =000 Experiment A Adaptive LASSO BIC BICDF LASSO γ = 0.5 γ = γ = 0 BIC OLSO Sparsity Pattern Unit Root Relevant Retained Loss Sparsity Pattern Unit Root Relevant Retained Loss able. ρ = 0 and β = 0.4, 0.3, 0., 0, 0, 0, 0, 0, 0, 0 he first thing one notices in Experiment B able is that now γ =.5 is the preferred value of γ for the Adaptive LASSO. his is in opposition to the nonstationary setting in Experiment A where γ = 0 yielded the best performance. It is also interesting that the LASSO is actually the strongest performer when =00. It selects the correct sparsity pattern most often, classifies ρ correctly, and retains all relevant variable in about 80% of the cases. However, its lack of variable selection consistency is underscored by the fact that the fraction of times it selects the correct sparsity pattern does not tend to one even as the sample size increases 8. In this stationary setting it seems less important to classify ρ correctly in order to achieve a low Loss on the hold out sample since all procedures incur roughly the same Loss. his is in opposition to Experiment A. For the Adaptive LASSO as well as the BIC all quantities approach the ones of the OLS Oracle as the sample size increases illustrating the oracle efficiency of these estimators. he two procedures perform about equally well in this stationary setting. he poor performance of the Adaptive LASSO for γ = 0 may seem to be in opposition to heorem but for = results not reported 8 he estimations were also carried out for = not reported here and even then the LASSO only detected the correct sparsity pattern in 47% of the instances.

13 ANDERS BREDAHL KOCK AARHUS UNIVERSIY AND CREAES here the Adaptive LASSO detects the correct sparsity pattern in 9% of the instances even with this value of γ. =00 =000 Experiment B Adaptive LASSO BIC BICDF LASSO γ = 0.5 γ = γ = 0 BIC OLSO Sparsity Pattern Unit Root Relevant Retained Loss Sparsity Pattern Unit Root Relevant Retained Loss able. ρ = 0.05 and β = 0.4, 0.3, 0., 0, 0, 0, 0, 0, 0, 0 As in Experiment A the data generating process possesses a unit root in Experiment C. Considering the results for =00 in able 3 the findings from Experiment A are roughly confirmed. Choosing γ = 0 for the Adaptive LASSO seems to be a wise choice in the unit root setting even with gaps in the lag structure. he Adaptive LASSO always classifies ρ correctly with this value of γ and outperforms its closest competitor the adaptive LASSO using BIC to choose γ by almost 0%. By considering the Loss on the hold out sample it is also seen that a correct classification of ρ results in a big gain in predictive power in the presence of a unit root confirming the finding in Experiment A. As the sample size increases the the Adaptive LASSO and the BIC perform better while the performance of the LASSO does not get better underscoring the oracle property of the first two procedures and the lack of the same for the LASSO. Note that the BICDF can not be expected to classify ρ correctly more often than in 95% of the cases in the presence of a unit root since testing is carried out at a 5% significance level. On the other hand, the correct classification probability of the Adaptive LASSO tends to one. Furthermore, the BIC is considerably slower to implement since the number of regressions to be run increases exponentially in the number of potential explanatory variables. 7. Conclusion his paper has shown that the Adaptive LASSO can be tuned to perform consistent model selection in stationary and nonstationary autoregressions. he estimator of the parameters converges at the oracle efficient rate, i.e. as fast as if an oracle had revealed the true model prior to estimation and only the relevant variables had bee included in a least squares estimation. his enables us to use the Adaptive LASSO to distinguish between stationary and nonstationary autoregressions. However, the Adaptive LASSO has no power against alternatives in a shrinking neighborhood around 0 when tuned to perform consistent variable selection. his problem can be alleviated by tuning the Adaptive LASSO to perform conservative model selection. he price paid compared to consistent model selection is that truly zero parameters are no longer set to zero with probability tending to one but still with positive probability.

14 HE ADAPIVE LASSO AND AUOREGRESSIONS 3 =00 =000 Experiment C Adaptive LASSO BIC BICDF LASSO γ = 0.5 γ = γ = 0 BIC OLSO Sparsity Pattern Unit Root Relevant Retained Loss Sparsity Pattern Unit Root Relevant Retained Loss able 3. ρ = 0 and β = 0.4, 0.3, 0., 0, 0, 0, 0., 0, 0, 0., 0, 0 Monte Carlo experiments confirm that the Adaptive LASSO performs well compared to standard competitors. his is the case in particular for nonstationary data. 8. Appendix Proof of heorem. For the proof of this theorem we will need the following results which can be found in e.g. Hamilton S X X S S X ɛ σ p j= β j 0 W r dr 0 0 Σ σ p j= βj 0 W rdw r Z := B. := A, We shall also make [ use of the fact that the least squares estimator, ˆρ I, ˆβ I, of ρ, β in satisfies that S ˆρI, ˆβ I ρ, β ] l O p First, let u = u, u where u is a scalar and u a p vector. Set ρ = u / and β j = βj + u j/ which implies that as a function of u can be written as Ψ u = y u y + λ w γ u + λ p j= p βj + u j y j w γ j j= β j + u j. Let û = û, û = arg min Ψ u and notice that û = ˆρ and û j = ˆβ j β j for j =,..., p. Define l

15 4 ANDERS BREDAHL KOCK AARHUS UNIVERSIY AND CREAES V u = Ψ u Ψ 0 = u S X X S u u S X ɛ + λ w γ u + λ p j= w γ j β j + u j Consider the first two terms in the above display. It follows from 5 and 6 that β j. 7 Furthermore, u S X X S u u S X ɛ u Au u B. 8 λ w γ u = λ ˆρ I γ u = u { λ γ ˆρ I in probability if u 0 γ 0 in probability if u = 0 since ˆρ I is tight. Also, if β j 0 9 λ w γ j β j + u j β j = λ = λ / ˆβ I,j γ ˆβ I,j 0 in probability u j βj + u j β j γ u j βj + u j uj / β j uj / since i: λ / / 0, ii: / ˆβI,j γ /β j γ < in probability and iii: u j β j + uj β j / uj u j signβ j. Finally, if β j = 0, 0 λ w γ j β j + u j β j = λ / u γ j = ˆβ I,j { in probability if u j 0 0 in probability if u j = 0 λ / γ/ γ u j ˆβI,j λ since i: and ii: ˆβ / γ / I,j is tight. Putting together 7-0 one concludes: { u Au u B if u = 0 and u j = 0 for all j A c V u Ψu = if u 0 or u j 0 for some j A c Since V u is convex and Ψu has a unique minimum it follows from Knight 999 that arg min V u arg min Ψu. Hence,

16 HE ADAPIVE LASSO AND AUOREGRESSIONS 5 3 û δ 0 û A c δ Ac 0 û A N0, σ [Σ A ] where δ 0 is the Dirac measure at 0 and A c is the cardinality of A c hence, δ Ac 0 is the A c - dimensional Dirac measure at 0. Notice that and imply that û 0 in probability and û A c 0 in probability. An equivalent formulation of -3 is ˆρ δ 0 ˆβA c βa c δ Ac ˆβA β A N0, σ [Σ A ] 4-6 yield the consistency part of the theorem at the rate of for ˆρ and for ˆβ. Notice that this also implies that no ˆβ j, j A will be set equal to 0 since for all j A, ˆβj converges in probability to βj 0. 6 also yields the oracle efficient asymptotic distribution, i.e. part 3 of the theorem. It remains to show part of the theorem; P ˆρ = 0 and P ˆβ,A c = 0. Both proofs are by contradiction. First, assume ˆρ 0. hen the first order conditions for a minimum read: which is equivalent to y y Consider first the second term: y X ˆρ, ˆβ + λ w γ signˆρ = 0 y X ˆρ, ˆβ 0 + λ w γ signˆρ = 0 λ w γ signˆρ = λ γ γ in probability ˆρ I since ˆρ I is tight. For the first term one has: y = y ɛ By 6, y ɛ σ p j= β j by 5. Hence, y ɛ y X ˆρ, ˆβ and y X S = y y X S S [ˆρ, ˆβ β ] 0 W rdw r. Furthermore, y X S ɛ X S S [ˆρ, ˆβ β ] σ p j= β j 0 W r dr, 0,..., 0 are tight. We also know that S [ˆρ, ˆβ β ] converges weakly by 4-6 which implies it is tight as well. aken together, y y X ˆρ, ˆβ is tight and so

17 6 ANDERS BREDAHL KOCK AARHUS UNIVERSIY AND CREAES P ˆρ 0 P y y X ˆρ, ˆβ Next, assume ˆβ j 0 for j A c. From the first order conditions + λ w γ signˆρ = 0 0 or equivalently, y j y X ˆρ, ˆβ + λ w γ j sign ˆβ j = 0 y j First, consider the second term y X ˆρ, ˆβ / + λ w γ j sign ˆβ j / = 0 λ w γ j sign ˆβ j / = λ γ w j λ = / / γ/ / ˆβ I,j γ since ˆβ I,j is tight. Regarding the first term, y j = y j ɛ / y X ˆρ, ˆβ / = y j y j X S S [ˆρ, ˆβ β ] / ɛ X S S [ˆρ, ˆβ β ] By 6 y j ɛ N0, σ Σ j where in accordance with previous notation Σ j is the jth diagonal / y j X S y element of Σ. 0, Σ / j,,..., Σ j,p by 5. Hence, j ɛ and y j X S are tight. / / he same is the case for S [ˆρ, ˆβ β ] since it converges weakly by 4-6. aken together, y j y X ˆρ, ˆβ / is tight and so P ˆβ j 0 P y j y X ˆρ, ˆβ / / + λ w γ j signˆρ = 0 0 / Proof of heorem. he proof runs along the same lines as the proof of heorem. For the proof we will need 7 and 8 below which can be found in e.g. Hamilton 994, Chapter 8. Notice that by definition of x t = y t, z t the lower right hand p p block of Q is Σ. We shall make use of the following limit results:

18 HE ADAPIVE LASSO AND AUOREGRESSIONS 7 X p 7 X Q 8 X ɛ N p+ 0, σ Q := B where the definition of B means that B is a random vector distributed as N p+ 0, σ Q We shall also make use of the fact that the least squares estimator is consistent under stationarity, i.e. [ ˆρI, ˆβ I ρ, β ] l O p First, let u = u, u where u is a scalar and u a p vector. Set ρ = ρ + u / and β j = β j + u j/ and Ψ u = y + λ w γ ρ + u y ρ + u + λ p p j= w γ j j= βj + u j y j β j + u j Let û = û, û = arg min Ψ u and notice that û = ˆρ ρ and û j = ˆβ j β j for j =,..., p. Define l Ṽ u = Ψ u Ψ 0 9 = u X X u u X ɛ + λ w γ ρ + u ρ + λ p j= w γ j β j + u j Consider the first two terms in the above display. It follows from 7 and 8 that Furthermore, since ρ 0 u X X u u X ɛ u Qu u B β j 0 λ w γ ρ + u γ ρ = λ u ˆρ I ρ + u ρ = λ γ / ˆρ I u ρ + u ρ 0 in probability u / u / since i: λ / / 0, ii: /ˆρI γ /ρ γ < in probability and ρ iii: u + u ρ u / u signρ.

19 8 ANDERS BREDAHL KOCK AARHUS UNIVERSIY AND CREAES Similarly, if βj 0 λ w γ j β j + u j β j = λ = λ / ˆβ I,j γ ˆβ I,j 0 in probability u j βj + u j β j γ u j βj + u j since i: λ / / 0, ii: / ˆβI,j γ /β j γ < in probability and iii: u j β j + uj β j / uj u j signβ j. Finally, if β j = 0, λ w γ j λ β j + u j β j = λ / u γ j = ˆβ I,j { in probability if u j 0 0 in probability if u j = 0 λ / γ/ since i: and ii ˆβ / γ / I,j is tight. Putting 9- together one concludes: { u Qu u B if uj = 0 for all j A c Ṽ u Ψu = if u j 0 for some j A c uj / β j uj / γ u j ˆβI,j Since Ṽ u is convex and Ψu has a unique minimum it follows from Knight 999 that arg min Ṽ u arg min Ψu. Hence, 3 4 û, û A N 0, σ [Q,A+ ] û A c δ Ac 0 where δ 0 is the Dirac measure at 0 and A c is the cardinality of A c hence, δ Ac 0 is the A c - dimensional Dirac measure at 0. Notice that 4 implies that û A c 0 in probability. An equivalent formulation of 3 and 4 is 5 6 ˆρ ρ ˆβA βa N 0, σ [Q,A+ ] ˆβA c β A c δ Ac 0 5 and 6 establish the consistency part of the theorem at the oracle rate of. Note that this also implies that for no j A will ˆβj be set equal to 0 since for each j A, ˆβj converges in probability to βj 0. he same is true for ˆρ. 5 also yields the oracle efficient asymptotic distribution, i.e. part 3 of the theorem. It remains to show part of the theorem; P ˆβ A c = 0. he proof is by contradiction.

20 HE ADAPIVE LASSO AND AUOREGRESSIONS 9 Assume ˆβ j 0 for j A c. From the first order conditions or equivalently, y j y X ˆρ, ˆβ + λ w γ j sign ˆβ j = 0 y j First, consider the second term y X ˆρ, ˆβ / + λ w γ j sign ˆβ j / = 0 λ w γ j sign ˆβ j / = λ γ w j λ = / / γ/ / ˆβ I,j γ since ˆβ I,j is tight. Regarding the first term, y j y X ˆρ, ˆβ y j = / = y j ɛ y j X [ˆρ ρ, ˆβ β ] / ɛ X [ˆρ ρ, ˆβ β ] By 8, y j ɛ N0, σ Q / j+ where in accordance with previous notation Q j+ is the j + th diagonal element of Q. y j X p Q j+,,..., Q j+,p+ by 7. Hence, y j ɛ and y j X / are tight. he same is the case for [ˆρ ρ, ˆβ β ] since it converges weakly by 5-6. Hence, P ˆβ j 0 P y j y X ˆρ, ˆβ / / + λ w γ j signˆρ = 0 0 / Before proving heorem 3 we prove the following lemma. Let x + = maxx, 0. Lemma. Let g : R R be given by gu = u au + λ u, λ 0, a 0. hen arg min g = 0 if and only if λ a. More precisely, arg min g = signa a λ +. Proof. Assume a > 0. Since g u = u a + λsignu is strictly negative for u < 0 arg min g [0,. ũ > 0 is a local minimum and hence a global minimum since g is strictly convex if and only if it is a stationary point, i.e. g ũ = ũ a+λ = 0 which is equivalent to 0 < ũ = a λ = a λ. his contradicts λ a and so arg min g = 0 when λ a. In total, the above shows that arg min g = a λ + = signa a λ for a > 0. Similar arguments establish the result for + a < 0.

21 0 ANDERS BREDAHL KOCK AARHUS UNIVERSIY AND CREAES Proof of heorem 3. ˆρ minimizes Lρ = y t ρy t ρ + λ ˆρ I = t= which is equvalent to minimizing ρ t= ρ y ty t t= y t yt + ρ t= + λ ρ ˆρI t= y t It follows from Lemma that ˆρ = 0 if and only if t= y t ρ t= ρ y t y t + λ ˆρ I = ρ ρˆρ I + λ ρ ˆρI t= y t ˆρI λ ˆρ I t= y t ˆρ I yt λ Hence, recalling that ˆρ I = t= y ty t / t= y t the least squares estimator P ˆρ = 0 = P ˆρ I t= [ ] yt t= λ = P y ty t yt λ t= t= y t [ ] = P ρ t= + y t ɛ t yt λ = P ρ + t= y t [ t= y t ɛ t t= y t Proof of heorem 4. From Phillips 987a one has t= ] + ρ t= y t ɛ t t= y t t= yt λ t= 7 y t ɛ t, t= Using heorem 3 with ρ = 0 yields t= y t σ 0 W s dw s, σ 0 Ws ds P ˆρ = 0 = P [ t= y t ɛ t t= y t ] From 7 and the continuous mapping theorem it follows that yt λ t= 8 G := [ t= y t ɛ t t= y t ] yt 0 W sdw s 0 W σ W s s ds := G ds 0 t= where the last definition means that G is a random variable distributed as [ 0 WsdWs 0 W s ds ] σ 0 W s ds.

22 HE ADAPIVE LASSO AND AUOREGRESSIONS Case : λ 0. Since the right hand side in 8 is absolutely continuous with respect to the Lebesgue measure it has no mass points and so P ˆρ = 0 = P G λ = FG λ F G 0 = 0 if λ 0. Case :λ λ 0,. By the same reasoning as in case it follows that P ˆρ = 0 = P G λ = FG λ F G λ = p 0, since G is supported on all of R +. Case 3: λ. Since G converges weakly it is tight and the result follows. Proof of heorem 5. By standard results see e.g. Hamilton 994 / y t ɛ t N0, σ E[yt ] t= t= Using heorem 3 with ρ, 0 yields y t P ˆρ = 0 = P ρ + / t= y t ɛ t Since t= y t = P ρ + / t= y t ɛ t / t= yt ɛt t= y t which implies that H := ρ + t= y t is tight it follows that / t= y t ɛ t t= y t / t= y t ɛ t t= y t p Ey t + / ρ + / ρ / t= y t ɛ t t= y t / t= y t ɛ t t= y t + ρ / t= y t ɛ t / t= y t + / ρ / t= y t ɛ t t= y t p 0 Case : λ / 0. Since H converges in probability to L it follows that t= P ˆρ = 0 = P H λ / P H L L/ 0 y t yt λ t= t= y t λ p ρ Ey t := L

23 ANDERS BREDAHL KOCK AARHUS UNIVERSIY AND CREAES where the estimate holds for sufficiently large since λ / 0. Case : λ / λ 0,. P ˆρ = 0 = P H λ { P H λ + L λ/ P H L L λ/4 0, λ < L P H λ λ L/ P H L + λ L/4, λ > L where the first estimate in each of the cases holds from a certain step and onwards. Case 3: λ /. Since H converges in probability it is tight and the result follows. Proof of heorem 6. By Phillips 987b 9 y t ɛ t, t= t= y t σ 0 J c rdw r, σ 0 Jc rdr where J c is the Ornstein-Uhlenbeck process with parameter c. Notice how the only difference to 7 is that the integrand process now is J c r instead of W r. For c = 0 they are identical as the Ornstein-Uhlenbeck process collapses to the Wiener process. From heorem 3 with ρ = c/ it follows [ ] P ˆρ = 0 = P c/ t= + y t ɛ t t= + c/ y t ɛ t yt λ = P c + [ t= y t t= y ] t ɛ t + c t= y t By the continuous mapping theorem K := c + [ t= y t ɛ t t= y t c + 0 J crdw r 0 J c rdr ] + c t= y t t= y t ɛ t t= y t t= y t ɛ t t= y t t= yt λ t= yt t= 0 + c J crdw r 0 J σ J c c rdr := K rdr 0 where the last definition means that K is a random variable distributed as the weak limit of K. Case : λ 0. Since K is absolutely continuous with respect to the Lebesgue measure it has no mass points and so P ˆρ = 0 = P K λ = F K λ F K 0 = 0 Case : λ λ 0,. By the same reasoning as in Case it follows that P ˆρ = 0 = P K λ = F K λ F K λ 0, since K is supported on all of R +. Case 3: λ. Since K converges weakly it is tight and the result follows.

24 HE ADAPIVE LASSO AND AUOREGRESSIONS 3 Proof of heorem 7. he setting is the same as in the proof of heorem. Follow the proof of that theorem, with identical notation, until 7 with γ = γ =. Next, notice that 30 λ w u = λ ˆρ I by 5 and 6 since C has no mass at 0. Furthermore, if β j 0 λ w j βj + u j β j u = λ = λ / ˆβ I,j = u λ ˆρ I λ u C u j βj + u j u j β j + u j 3 0 in probability since i: λ / / 0, ii: / ˆβ I,j /β j < in probability and iii: u j β j + uj β j / uj u j signβ j. 3 Finally, if β j = 0, λ w j βj + u j β j = λ / ˆβ I,j ˆβ I,j β j uj / β j uj / u j = λ u j λ u j ˆβI,j C j by 5 and 6 since i: λ λ and ii: C j is 0 with probability 0 such that x /x is continuous almost everywhere with respect to the limiting measure. Putting together 7 and30-3 one concludes V u u Au u B + λ u p C + λ u j C j { } := Ψu βj =0 Hence, since V u is convex and Ψu has a unique minimum it follows from Knight 999 that arg min V u arg min Ψu Proof of heorem 8. he setting is the same as in the proof of heorem. Follow the proof of that theorem, with identical notation, until with γ = γ =. For the case of βj = 0 one has 33 λ w j βj + u j β j = λ / ˆβ I,j j= u j = λ u j λ u j ˆβI,j C j by 7 and 8 since i: λ λ, ii: Cj is 0 with probability 0 such that x /x is continuous almost everywhere with respect to the limiting measure. Putting together 9- and 33 one concludes Ṽ u u Qu u B + λ p j= u j C j { β j =0 } := Ψu

25 4 ANDERS BREDAHL KOCK AARHUS UNIVERSIY AND CREAES Hence, since Ṽ u is convex and Ψu has a unique minimum it follows from Knight 999 that arg min Ṽ u arg min Ψu References Bühlmann, P. and S. Van De Geer 0. Statistics for High-Dimensional Data: Methods, heory and Applications. Springer-Verlag, New York. Candes, E. and. ao 007. he dantzig selector: Statistical estimation when p is much larger than n. he Annals of Statistics 35, Efron, B.,. Hastie, I. Johnstone, and R. ibshirani 004. Least angle regression. he Annals of statistics 3, Fan, J. and R. Li 00. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association 96, Fan, J. and J. Lv 008. Sure independence screening for ultrahigh dimensional feature space. Journal of the Royal Statistical Society: Series B Statistical Methodology 70, Hamilton, J. D ime Series Analysis. Cambridge University Press, Cambridge. Huang, J., J. L. Horowitz, and S. Ma 008. Asymptotic properties of bridge estimators in sparse high-dimensional regression models. he Annals of Statistics 36, Knight, K Epi-convergence in distribution and stochastic equi-semicontinuity. Unpublished manuscript. Knight, K. and W. Fu 000. Asymptotics for lasso-type estimators. Annals of Statistics, Knight, K. and W. Fu 0. An alternative to unit root tests: Bridge estimators differentiate between nonstationary versus stationary models and select optimal lag. Working paper. Kock, A. B. 0. Oracle efficient variable selection in random and fixed effects panel data models. Econometric heory forthcoming. Leeb, H. and B. Pötscher 008. Sparse estimators and the oracle property, or the return of hodges estimator. Journal of Econometrics 4, 0. Leeb, H. and B. M. Pötscher 005. Model selection and inference: Facts and fiction. Econometric heory, 59. Lehmann, E. L. and G. Casella 998. heory of point estimation, Volume 3. Springer Verlag. Meinshausen, N. and P. Bühlmann 006. High-dimensional graphs and variable selection with the lasso. he Annals of Statistics 34, Ng, S. and P. Perron 00. Lag length selection and the construction of unit root tests with good size and power. Econometrica 69, Phillips, P. C. B. 987a. ime series regression with a unit root. Econometrica, Phillips, P. C. B. 987b. owards a unified asymptotic theory for autoregression. Biometrika 74, Pötscher, B. and H. Leeb 009. On the distribution of penalized maximum likelihood estimators: he lasso, scad, and thresholding. Journal of Multivariate Analysis 00, ibshirani, R Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B Methodological, Wang, H., G. Li, and C. L. sai 007. Regression coefficient and autoregressive order shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B Statistical Methodology 69,

26 HE ADAPIVE LASSO AND AUOREGRESSIONS 5 Zhao, P. and B. Yu 006. On model selection consistency of lasso. he Journal of Machine Learning Research 7, Zou, H he adaptive lasso and its oracle properties. Journal of the American Statistical Association 0,

27 Research Papers : Peter Christoffersen, Ruslan Goyenko, Kris Jacobs, Mehdi Karoui: Illiquidity Premia in the Equity Options Market 0-44: Diego Amaya, Peter Christoffersen, Kris Jacobs and Aurelio Vasquez: Do Realized Skewness and Kurtosis Predict the Cross- Section of Equity Returns? 0-45: Peter Christoffersen and Hugues Langlois: he Joint Dynamics of Equity Market Factors 0-46: Peter Christoffersen, Kris Jacobs and Bo Young Chang: Forecasting with Option Implied Information 0-47: Kim Christensen and Mark Podolskij: Asymptotic theory of rangebased multipower variation 0-48: Christian M. Dahl, Daniel le Maire and Jakob R. Munch: Wage Dispersion and Decentralization of Wage Bargaining 0-49: orben G. Andersen, Oleg Bondarenko and Maria. Gonzalez-Perez: Coherent Model-Free Implied Volatility: A Corridor Fix for High- Frequency VIX 0-50: orben G. Andersen and Oleg Bondarenko: VPIN and the Flash Crash 0-5: im Bollerslev, Daniela Osterrieder, Natalia Sizova and George auchen: Risk and Return: Long-Run Relationships, Fractional Cointegration, and Return Predictability 0-5: Lars Stentoft: What we can learn from pricing 39,879 Individual Stock Options 0-53: Kim Christensen, Mark Podolskij and Mathias Vetter: On covariation estimation for multivariate continuous Itô semimartingales with noise in non-synchronous observation schemes 0-0: Matei Demetrescu and Robinson Kruse: he Power of Unit Root ests Against Nonlinear Local Alternatives 0-0: Matias D. Cattaneo, Michael Jansson and Whitney K. Newey: Alternative Asymptotics and the Partially Linear Model with Many Regressors 0-03: Matt P. Dziubinski: Conditionally-Uniform Feasible Grid Search Algorithm 0-04: Jeroen V.K. Rombouts, Lars Stentoft and Francesco Violante: he Value of Multivariate Model Sophistication: An Application to pricing Dow Jones Industrial Average options 0-05: Anders Bredahl Kock: On the Oracle Property of the Adaptive LASSO in Stationary and Nonstationary Autoregressions

Oracle Efficient Estimation and Forecasting with the Adaptive LASSO and the Adaptive Group LASSO in Vector Autoregressions

Oracle Efficient Estimation and Forecasting with the Adaptive LASSO and the Adaptive Group LASSO in Vector Autoregressions Oracle Efficient Estimation and Forecasting with the Adaptive LASSO and the Adaptive Group LASSO in Vector Autoregressions Anders Bredahl Kock and Laurent A.F. Callot CREAES Research Paper 2012-38 Department

More information

The Adaptive Lasso and Its Oracle Properties Hui Zou (2006), JASA

The Adaptive Lasso and Its Oracle Properties Hui Zou (2006), JASA The Adaptive Lasso and Its Oracle Properties Hui Zou (2006), JASA Presented by Dongjun Chung March 12, 2010 Introduction Definition Oracle Properties Computations Relationship: Nonnegative Garrote Extensions:

More information

MA Advanced Econometrics: Applying Least Squares to Time Series

MA Advanced Econometrics: Applying Least Squares to Time Series MA Advanced Econometrics: Applying Least Squares to Time Series Karl Whelan School of Economics, UCD February 15, 2011 Karl Whelan (UCD) Time Series February 15, 2011 1 / 24 Part I Time Series: Standard

More information

A Bootstrap Lasso + Partial Ridge Method to Construct Confidence Intervals for Parameters in High-dimensional Sparse Linear Models

A Bootstrap Lasso + Partial Ridge Method to Construct Confidence Intervals for Parameters in High-dimensional Sparse Linear Models A Bootstrap Lasso + Partial Ridge Method to Construct Confidence Intervals for Parameters in High-dimensional Sparse Linear Models Jingyi Jessica Li Department of Statistics University of California, Los

More information

Bayesian variable selection via. Penalized credible regions. Brian Reich, NCSU. Joint work with. Howard Bondell and Ander Wilson

Bayesian variable selection via. Penalized credible regions. Brian Reich, NCSU. Joint work with. Howard Bondell and Ander Wilson Bayesian variable selection via penalized credible regions Brian Reich, NC State Joint work with Howard Bondell and Ander Wilson Brian Reich, NCSU Penalized credible regions 1 Motivation big p, small n

More information

On Model Selection Consistency of Lasso

On Model Selection Consistency of Lasso On Model Selection Consistency of Lasso Peng Zhao Department of Statistics University of Berkeley 367 Evans Hall Berkeley, CA 94720-3860, USA Bin Yu Department of Statistics University of Berkeley 367

More information

WEIGHTED QUANTILE REGRESSION THEORY AND ITS APPLICATION. Abstract

WEIGHTED QUANTILE REGRESSION THEORY AND ITS APPLICATION. Abstract Journal of Data Science,17(1). P. 145-160,2019 DOI:10.6339/JDS.201901_17(1).0007 WEIGHTED QUANTILE REGRESSION THEORY AND ITS APPLICATION Wei Xiong *, Maozai Tian 2 1 School of Statistics, University of

More information

BCT Lecture 3. Lukas Vacha.

BCT Lecture 3. Lukas Vacha. BCT Lecture 3 Lukas Vacha vachal@utia.cas.cz Stationarity and Unit Root Testing Why do we need to test for Non-Stationarity? The stationarity or otherwise of a series can strongly influence its behaviour

More information

Economics Division University of Southampton Southampton SO17 1BJ, UK. Title Overlapping Sub-sampling and invariance to initial conditions

Economics Division University of Southampton Southampton SO17 1BJ, UK. Title Overlapping Sub-sampling and invariance to initial conditions Economics Division University of Southampton Southampton SO17 1BJ, UK Discussion Papers in Economics and Econometrics Title Overlapping Sub-sampling and invariance to initial conditions By Maria Kyriacou

More information

Data Mining Stat 588

Data Mining Stat 588 Data Mining Stat 588 Lecture 02: Linear Methods for Regression Department of Statistics & Biostatistics Rutgers University September 13 2011 Regression Problem Quantitative generic output variable Y. Generic

More information

Lecture 14: Variable Selection - Beyond LASSO

Lecture 14: Variable Selection - Beyond LASSO Fall, 2017 Extension of LASSO To achieve oracle properties, L q penalty with 0 < q < 1, SCAD penalty (Fan and Li 2001; Zhang et al. 2007). Adaptive LASSO (Zou 2006; Zhang and Lu 2007; Wang et al. 2007)

More information

Stepwise Searching for Feature Variables in High-Dimensional Linear Regression

Stepwise Searching for Feature Variables in High-Dimensional Linear Regression Stepwise Searching for Feature Variables in High-Dimensional Linear Regression Qiwei Yao Department of Statistics, London School of Economics q.yao@lse.ac.uk Joint work with: Hongzhi An, Chinese Academy

More information

An Alternative to Unit Root Tests: Bridge Estimators Differentiate between Nonstationary versus Stationary Models and Select Optimal Lag

An Alternative to Unit Root Tests: Bridge Estimators Differentiate between Nonstationary versus Stationary Models and Select Optimal Lag An Alternative to Unit Root ests: Bridge Estimators Differentiate between Nonstationary versus Stationary Models and Select Optimal Lag Mehmet Caner North Carolina State University Keith Knight University

More information

Variable Selection for Highly Correlated Predictors

Variable Selection for Highly Correlated Predictors Variable Selection for Highly Correlated Predictors Fei Xue and Annie Qu Department of Statistics, University of Illinois at Urbana-Champaign WHOA-PSI, Aug, 2017 St. Louis, Missouri 1 / 30 Background Variable

More information

Variable Selection for Highly Correlated Predictors

Variable Selection for Highly Correlated Predictors Variable Selection for Highly Correlated Predictors Fei Xue and Annie Qu arxiv:1709.04840v1 [stat.me] 14 Sep 2017 Abstract Penalty-based variable selection methods are powerful in selecting relevant covariates

More information

Non-Stationary Time Series and Unit Root Testing

Non-Stationary Time Series and Unit Root Testing Econometrics II Non-Stationary Time Series and Unit Root Testing Morten Nyboe Tabor Course Outline: Non-Stationary Time Series and Unit Root Testing 1 Stationarity and Deviation from Stationarity Trend-Stationarity

More information

Lasso Maximum Likelihood Estimation of Parametric Models with Singular Information Matrices

Lasso Maximum Likelihood Estimation of Parametric Models with Singular Information Matrices Article Lasso Maximum Likelihood Estimation of Parametric Models with Singular Information Matrices Fei Jin 1,2 and Lung-fei Lee 3, * 1 School of Economics, Shanghai University of Finance and Economics,

More information

Comparisons of penalized least squares. methods by simulations

Comparisons of penalized least squares. methods by simulations Comparisons of penalized least squares arxiv:1405.1796v1 [stat.co] 8 May 2014 methods by simulations Ke ZHANG, Fan YIN University of Science and Technology of China, Hefei 230026, China Shifeng XIONG Academy

More information

Using all observations when forecasting under structural breaks

Using all observations when forecasting under structural breaks Using all observations when forecasting under structural breaks Stanislav Anatolyev New Economic School Victor Kitov Moscow State University December 2007 Abstract We extend the idea of the trade-off window

More information

Linear regression methods

Linear regression methods Linear regression methods Most of our intuition about statistical methods stem from linear regression. For observations i = 1,..., n, the model is Y i = p X ij β j + ε i, j=1 where Y i is the response

More information

MSA220/MVE440 Statistical Learning for Big Data

MSA220/MVE440 Statistical Learning for Big Data MSA220/MVE440 Statistical Learning for Big Data Lecture 7/8 - High-dimensional modeling part 1 Rebecka Jörnsten Mathematical Sciences University of Gothenburg and Chalmers University of Technology Classification

More information

Linear Model Selection and Regularization

Linear Model Selection and Regularization Linear Model Selection and Regularization Recall the linear model Y = β 0 + β 1 X 1 + + β p X p + ɛ. In the lectures that follow, we consider some approaches for extending the linear model framework. In

More information

MSA220/MVE440 Statistical Learning for Big Data

MSA220/MVE440 Statistical Learning for Big Data MSA220/MVE440 Statistical Learning for Big Data Lecture 9-10 - High-dimensional regression Rebecka Jörnsten Mathematical Sciences University of Gothenburg and Chalmers University of Technology Recap from

More information

The deterministic Lasso

The deterministic Lasso The deterministic Lasso Sara van de Geer Seminar für Statistik, ETH Zürich Abstract We study high-dimensional generalized linear models and empirical risk minimization using the Lasso An oracle inequality

More information

TECHNICAL REPORT NO. 1091r. A Note on the Lasso and Related Procedures in Model Selection

TECHNICAL REPORT NO. 1091r. A Note on the Lasso and Related Procedures in Model Selection DEPARTMENT OF STATISTICS University of Wisconsin 1210 West Dayton St. Madison, WI 53706 TECHNICAL REPORT NO. 1091r April 2004, Revised December 2004 A Note on the Lasso and Related Procedures in Model

More information

Moreover, the second term is derived from: 1 T ) 2 1

Moreover, the second term is derived from: 1 T ) 2 1 170 Moreover, the second term is derived from: 1 T T ɛt 2 σ 2 ɛ. Therefore, 1 σ 2 ɛt T y t 1 ɛ t = 1 2 ( yt σ T ) 2 1 2σ 2 ɛ 1 T T ɛt 2 1 2 (χ2 (1) 1). (b) Next, consider y 2 t 1. T E y 2 t 1 T T = E(y

More information

Variable Selection in Predictive Regressions

Variable Selection in Predictive Regressions Variable Selection in Predictive Regressions Alessandro Stringhi Advanced Financial Econometrics III Winter/Spring 2018 Overview This chapter considers linear models for explaining a scalar variable when

More information

A TIME SERIES PARADOX: UNIT ROOT TESTS PERFORM POORLY WHEN DATA ARE COINTEGRATED

A TIME SERIES PARADOX: UNIT ROOT TESTS PERFORM POORLY WHEN DATA ARE COINTEGRATED A TIME SERIES PARADOX: UNIT ROOT TESTS PERFORM POORLY WHEN DATA ARE COINTEGRATED by W. Robert Reed Department of Economics and Finance University of Canterbury, New Zealand Email: bob.reed@canterbury.ac.nz

More information

Machine Learning for OR & FE

Machine Learning for OR & FE Machine Learning for OR & FE Regression II: Regularization and Shrinkage Methods Martin Haugh Department of Industrial Engineering and Operations Research Columbia University Email: martin.b.haugh@gmail.com

More information

A Test of Cointegration Rank Based Title Component Analysis.

A Test of Cointegration Rank Based Title Component Analysis. A Test of Cointegration Rank Based Title Component Analysis Author(s) Chigira, Hiroaki Citation Issue 2006-01 Date Type Technical Report Text Version publisher URL http://hdl.handle.net/10086/13683 Right

More information

Analysis Methods for Supersaturated Design: Some Comparisons

Analysis Methods for Supersaturated Design: Some Comparisons Journal of Data Science 1(2003), 249-260 Analysis Methods for Supersaturated Design: Some Comparisons Runze Li 1 and Dennis K. J. Lin 2 The Pennsylvania State University Abstract: Supersaturated designs

More information

ECONOMETRICS II, FALL Testing for Unit Roots.

ECONOMETRICS II, FALL Testing for Unit Roots. ECONOMETRICS II, FALL 216 Testing for Unit Roots. In the statistical literature it has long been known that unit root processes behave differently from stable processes. For example in the scalar AR(1)

More information

Non-Stationary Time Series and Unit Root Testing

Non-Stationary Time Series and Unit Root Testing Econometrics II Non-Stationary Time Series and Unit Root Testing Morten Nyboe Tabor Course Outline: Non-Stationary Time Series and Unit Root Testing 1 Stationarity and Deviation from Stationarity Trend-Stationarity

More information

Inflation Revisited: New Evidence from Modified Unit Root Tests

Inflation Revisited: New Evidence from Modified Unit Root Tests 1 Inflation Revisited: New Evidence from Modified Unit Root Tests Walter Enders and Yu Liu * University of Alabama in Tuscaloosa and University of Texas at El Paso Abstract: We propose a simple modification

More information

Non-Stationary Time Series and Unit Root Testing

Non-Stationary Time Series and Unit Root Testing Econometrics II Non-Stationary Time Series and Unit Root Testing Morten Nyboe Tabor Course Outline: Non-Stationary Time Series and Unit Root Testing 1 Stationarity and Deviation from Stationarity Trend-Stationarity

More information

MS-C1620 Statistical inference

MS-C1620 Statistical inference MS-C1620 Statistical inference 10 Linear regression III Joni Virta Department of Mathematics and Systems Analysis School of Science Aalto University Academic year 2018 2019 Period III - IV 1 / 32 Contents

More information

High-dimensional Ordinary Least-squares Projection for Screening Variables

High-dimensional Ordinary Least-squares Projection for Screening Variables 1 / 38 High-dimensional Ordinary Least-squares Projection for Screening Variables Chenlei Leng Joint with Xiangyu Wang (Duke) Conference on Nonparametric Statistics for Big Data and Celebration to Honor

More information

Inference for High Dimensional Robust Regression

Inference for High Dimensional Robust Regression Department of Statistics UC Berkeley Stanford-Berkeley Joint Colloquium, 2015 Table of Contents 1 Background 2 Main Results 3 OLS: A Motivating Example Table of Contents 1 Background 2 Main Results 3 OLS:

More information

Keywords: sparse models, shrinkage, LASSO, adalasso, time series, forecasting, GARCH.

Keywords: sparse models, shrinkage, LASSO, adalasso, time series, forecasting, GARCH. l 1 -REGULARIZAION OF HIGH-DIMENSIONAL IME-SERIES MODELS WIH FLEXIBLE INNOVAIONS Marcelo C. Medeiros Department of Economics Pontifical Catholic University of Rio de Janeiro Rua Marquês de São Vicente

More information

Econometrics. Week 11. Fall Institute of Economic Studies Faculty of Social Sciences Charles University in Prague

Econometrics. Week 11. Fall Institute of Economic Studies Faculty of Social Sciences Charles University in Prague Econometrics Week 11 Institute of Economic Studies Faculty of Social Sciences Charles University in Prague Fall 2012 1 / 30 Recommended Reading For the today Advanced Time Series Topics Selected topics

More information

Efficiency Tradeoffs in Estimating the Linear Trend Plus Noise Model. Abstract

Efficiency Tradeoffs in Estimating the Linear Trend Plus Noise Model. Abstract Efficiency radeoffs in Estimating the Linear rend Plus Noise Model Barry Falk Department of Economics, Iowa State University Anindya Roy University of Maryland Baltimore County Abstract his paper presents

More information

A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables

A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables Niharika Gauraha and Swapan Parui Indian Statistical Institute Abstract. We consider the problem of

More information

arxiv: v2 [math.st] 29 Jan 2018

arxiv: v2 [math.st] 29 Jan 2018 On the Distribution, Model Selection Properties and Uniqueness of the Lasso Estimator in Low and High Dimensions Ulrike Schneider and Karl Ewald Vienna University of Technology arxiv:1708.09608v2 [math.st]

More information

Stability and the elastic net

Stability and the elastic net Stability and the elastic net Patrick Breheny March 28 Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 1/32 Introduction Elastic Net Our last several lectures have concentrated on methods for

More information

A Confidence Region Approach to Tuning for Variable Selection

A Confidence Region Approach to Tuning for Variable Selection A Confidence Region Approach to Tuning for Variable Selection Funda Gunes and Howard D. Bondell Department of Statistics North Carolina State University Abstract We develop an approach to tuning of penalized

More information

Time Series Models and Inference. James L. Powell Department of Economics University of California, Berkeley

Time Series Models and Inference. James L. Powell Department of Economics University of California, Berkeley Time Series Models and Inference James L. Powell Department of Economics University of California, Berkeley Overview In contrast to the classical linear regression model, in which the components of the

More information

Peter Hoff Minimax estimation November 12, Motivation and definition. 2 Least favorable prior 3. 3 Least favorable prior sequence 11

Peter Hoff Minimax estimation November 12, Motivation and definition. 2 Least favorable prior 3. 3 Least favorable prior sequence 11 Contents 1 Motivation and definition 1 2 Least favorable prior 3 3 Least favorable prior sequence 11 4 Nonparametric problems 15 5 Minimax and admissibility 18 6 Superefficiency and sparsity 19 Most of

More information

Simulating Properties of the Likelihood Ratio Test for a Unit Root in an Explosive Second Order Autoregression

Simulating Properties of the Likelihood Ratio Test for a Unit Root in an Explosive Second Order Autoregression Simulating Properties of the Likelihood Ratio est for a Unit Root in an Explosive Second Order Autoregression Bent Nielsen Nuffield College, University of Oxford J James Reade St Cross College, University

More information

Time Series Analysis. James D. Hamilton PRINCETON UNIVERSITY PRESS PRINCETON, NEW JERSEY

Time Series Analysis. James D. Hamilton PRINCETON UNIVERSITY PRESS PRINCETON, NEW JERSEY Time Series Analysis James D. Hamilton PRINCETON UNIVERSITY PRESS PRINCETON, NEW JERSEY & Contents PREFACE xiii 1 1.1. 1.2. Difference Equations First-Order Difference Equations 1 /?th-order Difference

More information

(Part 1) High-dimensional statistics May / 41

(Part 1) High-dimensional statistics May / 41 Theory for the Lasso Recall the linear model Y i = p j=1 β j X (j) i + ɛ i, i = 1,..., n, or, in matrix notation, Y = Xβ + ɛ, To simplify, we assume that the design X is fixed, and that ɛ is N (0, σ 2

More information

The Role of "Leads" in the Dynamic Title of Cointegrating Regression Models. Author(s) Hayakawa, Kazuhiko; Kurozumi, Eiji

The Role of Leads in the Dynamic Title of Cointegrating Regression Models. Author(s) Hayakawa, Kazuhiko; Kurozumi, Eiji he Role of "Leads" in the Dynamic itle of Cointegrating Regression Models Author(s) Hayakawa, Kazuhiko; Kurozumi, Eiji Citation Issue 2006-12 Date ype echnical Report ext Version publisher URL http://hdl.handle.net/10086/13599

More information

Long memory and changing persistence

Long memory and changing persistence Long memory and changing persistence Robinson Kruse and Philipp Sibbertsen August 010 Abstract We study the empirical behaviour of semi-parametric log-periodogram estimation for long memory models when

More information

Nonstationary Time Series:

Nonstationary Time Series: Nonstationary Time Series: Unit Roots Egon Zakrajšek Division of Monetary Affairs Federal Reserve Board Summer School in Financial Mathematics Faculty of Mathematics & Physics University of Ljubljana September

More information

Single Index Quantile Regression for Heteroscedastic Data

Single Index Quantile Regression for Heteroscedastic Data Single Index Quantile Regression for Heteroscedastic Data E. Christou M. G. Akritas Department of Statistics The Pennsylvania State University SMAC, November 6, 2015 E. Christou, M. G. Akritas (PSU) SIQR

More information

Shrinkage Methods: Ridge and Lasso

Shrinkage Methods: Ridge and Lasso Shrinkage Methods: Ridge and Lasso Jonathan Hersh 1 Chapman University, Argyros School of Business hersh@chapman.edu February 27, 2019 J.Hersh (Chapman) Ridge & Lasso February 27, 2019 1 / 43 1 Intro and

More information

STAT 992 Paper Review: Sure Independence Screening in Generalized Linear Models with NP-Dimensionality J.Fan and R.Song

STAT 992 Paper Review: Sure Independence Screening in Generalized Linear Models with NP-Dimensionality J.Fan and R.Song STAT 992 Paper Review: Sure Independence Screening in Generalized Linear Models with NP-Dimensionality J.Fan and R.Song Presenter: Jiwei Zhao Department of Statistics University of Wisconsin Madison April

More information

The Risk of James Stein and Lasso Shrinkage

The Risk of James Stein and Lasso Shrinkage Econometric Reviews ISSN: 0747-4938 (Print) 1532-4168 (Online) Journal homepage: http://tandfonline.com/loi/lecr20 The Risk of James Stein and Lasso Shrinkage Bruce E. Hansen To cite this article: Bruce

More information

Iterative Selection Using Orthogonal Regression Techniques

Iterative Selection Using Orthogonal Regression Techniques Iterative Selection Using Orthogonal Regression Techniques Bradley Turnbull 1, Subhashis Ghosal 1 and Hao Helen Zhang 2 1 Department of Statistics, North Carolina State University, Raleigh, NC, USA 2 Department

More information

DEPARTMENT OF ECONOMICS AND FINANCE COLLEGE OF BUSINESS AND ECONOMICS UNIVERSITY OF CANTERBURY CHRISTCHURCH, NEW ZEALAND

DEPARTMENT OF ECONOMICS AND FINANCE COLLEGE OF BUSINESS AND ECONOMICS UNIVERSITY OF CANTERBURY CHRISTCHURCH, NEW ZEALAND DEPARTMENT OF ECONOMICS AND FINANCE COLLEGE OF BUSINESS AND ECONOMICS UNIVERSITY OF CANTERBURY CHRISTCHURCH, NEW ZEALAND Testing For Unit Roots With Cointegrated Data NOTE: This paper is a revision of

More information

Unit Root and Cointegration

Unit Root and Cointegration Unit Root and Cointegration Carlos Hurtado Department of Economics University of Illinois at Urbana-Champaign hrtdmrt@illinois.edu Oct 7th, 016 C. Hurtado (UIUC - Economics) Applied Econometrics On the

More information

Empirical Market Microstructure Analysis (EMMA)

Empirical Market Microstructure Analysis (EMMA) Empirical Market Microstructure Analysis (EMMA) Lecture 3: Statistical Building Blocks and Econometric Basics Prof. Dr. Michael Stein michael.stein@vwl.uni-freiburg.de Albert-Ludwigs-University of Freiburg

More information

Peter Hoff Minimax estimation October 31, Motivation and definition. 2 Least favorable prior 3. 3 Least favorable prior sequence 11

Peter Hoff Minimax estimation October 31, Motivation and definition. 2 Least favorable prior 3. 3 Least favorable prior sequence 11 Contents 1 Motivation and definition 1 2 Least favorable prior 3 3 Least favorable prior sequence 11 4 Nonparametric problems 15 5 Minimax and admissibility 18 6 Superefficiency and sparsity 19 Most of

More information

Multivariate Time Series: VAR(p) Processes and Models

Multivariate Time Series: VAR(p) Processes and Models Multivariate Time Series: VAR(p) Processes and Models A VAR(p) model, for p > 0 is X t = φ 0 + Φ 1 X t 1 + + Φ p X t p + A t, where X t, φ 0, and X t i are k-vectors, Φ 1,..., Φ p are k k matrices, with

More information

LASSO-type penalties for covariate selection and forecasting in time series

LASSO-type penalties for covariate selection and forecasting in time series LASSO-type penalties for covariate selection and forecasting in time series Evandro Konzen 1 Flavio A. Ziegelmann 2 Abstract This paper studies some forms of LASSO-type penalties in time series to reduce

More information

Reconstruction from Anisotropic Random Measurements

Reconstruction from Anisotropic Random Measurements Reconstruction from Anisotropic Random Measurements Mark Rudelson and Shuheng Zhou The University of Michigan, Ann Arbor Coding, Complexity, and Sparsity Workshop, 013 Ann Arbor, Michigan August 7, 013

More information

The Number of Bootstrap Replicates in Bootstrap Dickey-Fuller Unit Root Tests

The Number of Bootstrap Replicates in Bootstrap Dickey-Fuller Unit Root Tests Working Paper 2013:8 Department of Statistics The Number of Bootstrap Replicates in Bootstrap Dickey-Fuller Unit Root Tests Jianxin Wei Working Paper 2013:8 June 2013 Department of Statistics Uppsala

More information

Testing for Unit Roots with Cointegrated Data

Testing for Unit Roots with Cointegrated Data Discussion Paper No. 2015-57 August 19, 2015 http://www.economics-ejournal.org/economics/discussionpapers/2015-57 Testing for Unit Roots with Cointegrated Data W. Robert Reed Abstract This paper demonstrates

More information

The MNet Estimator. Patrick Breheny. Department of Biostatistics Department of Statistics University of Kentucky. August 2, 2010

The MNet Estimator. Patrick Breheny. Department of Biostatistics Department of Statistics University of Kentucky. August 2, 2010 Department of Biostatistics Department of Statistics University of Kentucky August 2, 2010 Joint work with Jian Huang, Shuangge Ma, and Cun-Hui Zhang Penalized regression methods Penalized methods have

More information

Compressed Sensing in Cancer Biology? (A Work in Progress)

Compressed Sensing in Cancer Biology? (A Work in Progress) Compressed Sensing in Cancer Biology? (A Work in Progress) M. Vidyasagar FRS Cecil & Ida Green Chair The University of Texas at Dallas M.Vidyasagar@utdallas.edu www.utdallas.edu/ m.vidyasagar University

More information

Bayesian Grouped Horseshoe Regression with Application to Additive Models

Bayesian Grouped Horseshoe Regression with Application to Additive Models Bayesian Grouped Horseshoe Regression with Application to Additive Models Zemei Xu, Daniel F. Schmidt, Enes Makalic, Guoqi Qian, and John L. Hopper Centre for Epidemiology and Biostatistics, Melbourne

More information

Delta Theorem in the Age of High Dimensions

Delta Theorem in the Age of High Dimensions Delta Theorem in the Age of High Dimensions Mehmet Caner Department of Economics Ohio State University December 15, 2016 Abstract We provide a new version of delta theorem, that takes into account of high

More information

Regularization: Ridge Regression and the LASSO

Regularization: Ridge Regression and the LASSO Agenda Wednesday, November 29, 2006 Agenda Agenda 1 The Bias-Variance Tradeoff 2 Ridge Regression Solution to the l 2 problem Data Augmentation Approach Bayesian Interpretation The SVD and Ridge Regression

More information

Consider the trend-cycle decomposition of a time series y t

Consider the trend-cycle decomposition of a time series y t 1 Unit Root Tests Consider the trend-cycle decomposition of a time series y t y t = TD t + TS t + C t = TD t + Z t The basic issue in unit root testing is to determine if TS t = 0. Two classes of tests,

More information

The Spurious Regression of Fractionally Integrated Processes

The Spurious Regression of Fractionally Integrated Processes he Spurious Regression of Fractionally Integrated Processes Wen-Jen say Institute of Economics Academia Sinica Ching-Fan Chung Institute of Economics Academia Sinica January, 999 Abstract his paper extends

More information

Smoothly Clipped Absolute Deviation (SCAD) for Correlated Variables

Smoothly Clipped Absolute Deviation (SCAD) for Correlated Variables Smoothly Clipped Absolute Deviation (SCAD) for Correlated Variables LIB-MA, FSSM Cadi Ayyad University (Morocco) COMPSTAT 2010 Paris, August 22-27, 2010 Motivations Fan and Li (2001), Zou and Li (2008)

More information

Lecture 14: Shrinkage

Lecture 14: Shrinkage Lecture 14: Shrinkage Reading: Section 6.2 STATS 202: Data mining and analysis October 27, 2017 1 / 19 Shrinkage methods The idea is to perform a linear regression, while regularizing or shrinking the

More information

Cointegration and the joint con rmation hypothesis

Cointegration and the joint con rmation hypothesis Cointegration and the joint con rmation hypothesis VASCO J. GABRIEL Department of Economics, Birkbeck College, UK University of Minho, Portugal October 2001 Abstract Recent papers by Charemza and Syczewska

More information

Permutation-invariant regularization of large covariance matrices. Liza Levina

Permutation-invariant regularization of large covariance matrices. Liza Levina Liza Levina Permutation-invariant covariance regularization 1/42 Permutation-invariant regularization of large covariance matrices Liza Levina Department of Statistics University of Michigan Joint work

More information

A Survey of L 1. Regression. Céline Cunen, 20/10/2014. Vidaurre, Bielza and Larranaga (2013)

A Survey of L 1. Regression. Céline Cunen, 20/10/2014. Vidaurre, Bielza and Larranaga (2013) A Survey of L 1 Regression Vidaurre, Bielza and Larranaga (2013) Céline Cunen, 20/10/2014 Outline of article 1.Introduction 2.The Lasso for Linear Regression a) Notation and Main Concepts b) Statistical

More information

Time Series Analysis. James D. Hamilton PRINCETON UNIVERSITY PRESS PRINCETON, NEW JERSEY

Time Series Analysis. James D. Hamilton PRINCETON UNIVERSITY PRESS PRINCETON, NEW JERSEY Time Series Analysis James D. Hamilton PRINCETON UNIVERSITY PRESS PRINCETON, NEW JERSEY PREFACE xiii 1 Difference Equations 1.1. First-Order Difference Equations 1 1.2. pth-order Difference Equations 7

More information

Two Tales of Variable Selection for High Dimensional Regression: Screening and Model Building

Two Tales of Variable Selection for High Dimensional Regression: Screening and Model Building Two Tales of Variable Selection for High Dimensional Regression: Screening and Model Building Cong Liu, Tao Shi and Yoonkyung Lee Department of Statistics, The Ohio State University Abstract Variable selection

More information

A Gaussian IV estimator of cointegrating relations Gunnar Bårdsen and Niels Haldrup. 13 February 2006.

A Gaussian IV estimator of cointegrating relations Gunnar Bårdsen and Niels Haldrup. 13 February 2006. A Gaussian IV estimator of cointegrating relations Gunnar Bårdsen and Niels Haldrup 3 February 26. Abstract. In static single equation cointegration regression models the OLS estimator will have a non-standard

More information

The adaptive and the thresholded Lasso for potentially misspecified models (and a lower bound for the Lasso)

The adaptive and the thresholded Lasso for potentially misspecified models (and a lower bound for the Lasso) Electronic Journal of Statistics Vol. 0 (2010) ISSN: 1935-7524 The adaptive the thresholded Lasso for potentially misspecified models ( a lower bound for the Lasso) Sara van de Geer Peter Bühlmann Seminar

More information

Lecture 2: Univariate Time Series

Lecture 2: Univariate Time Series Lecture 2: Univariate Time Series Analysis: Conditional and Unconditional Densities, Stationarity, ARMA Processes Prof. Massimo Guidolin 20192 Financial Econometrics Spring/Winter 2017 Overview Motivation:

More information

DISCUSSION OF A SIGNIFICANCE TEST FOR THE LASSO. By Peter Bühlmann, Lukas Meier and Sara van de Geer ETH Zürich

DISCUSSION OF A SIGNIFICANCE TEST FOR THE LASSO. By Peter Bühlmann, Lukas Meier and Sara van de Geer ETH Zürich Submitted to the Annals of Statistics DISCUSSION OF A SIGNIFICANCE TEST FOR THE LASSO By Peter Bühlmann, Lukas Meier and Sara van de Geer ETH Zürich We congratulate Richard Lockhart, Jonathan Taylor, Ryan

More information

Very preliminary, please do not cite without permission. ESTIMATION AND FORECASTING LARGE REALIZED COVARIANCE MATRICES LAURENT A. F.

Very preliminary, please do not cite without permission. ESTIMATION AND FORECASTING LARGE REALIZED COVARIANCE MATRICES LAURENT A. F. Very preliminary, please do not cite without permission. ESTIMATION AND FORECASTING LARGE REALIZED COVARIANCE MATRICES LAURENT A. F. CALLOT VU University Amsterdam, The Netherlands, CREATES, and the Tinbergen

More information

Generalized Elastic Net Regression

Generalized Elastic Net Regression Abstract Generalized Elastic Net Regression Geoffroy MOURET Jean-Jules BRAULT Vahid PARTOVINIA This work presents a variation of the elastic net penalization method. We propose applying a combined l 1

More information

ECON 616: Lecture Two: Deterministic Trends, Nonstationary Processes

ECON 616: Lecture Two: Deterministic Trends, Nonstationary Processes ECON 616: Lecture Two: Deterministic Trends, Nonstationary Processes ED HERBST September 11, 2017 Background Hamilton, chapters 15-16 Trends vs Cycles A commond decomposition of macroeconomic time series

More information

Chris Fraley and Daniel Percival. August 22, 2008, revised May 14, 2010

Chris Fraley and Daniel Percival. August 22, 2008, revised May 14, 2010 Model-Averaged l 1 Regularization using Markov Chain Monte Carlo Model Composition Technical Report No. 541 Department of Statistics, University of Washington Chris Fraley and Daniel Percival August 22,

More information

Nonconcave Penalized Likelihood with A Diverging Number of Parameters

Nonconcave Penalized Likelihood with A Diverging Number of Parameters Nonconcave Penalized Likelihood with A Diverging Number of Parameters Jianqing Fan and Heng Peng Presenter: Jiale Xu March 12, 2010 Jianqing Fan and Heng Peng Presenter: JialeNonconcave Xu () Penalized

More information

Stat542 (F11) Statistical Learning. First consider the scenario where the two classes of points are separable.

Stat542 (F11) Statistical Learning. First consider the scenario where the two classes of points are separable. Linear SVM (separable case) First consider the scenario where the two classes of points are separable. It s desirable to have the width (called margin) between the two dashed lines to be large, i.e., have

More information

On the Correlations of Trend-Cycle Errors

On the Correlations of Trend-Cycle Errors On the Correlations of Trend-Cycle Errors Tatsuma Wada Wayne State University This version: December 19, 11 Abstract This note provides explanations for an unexpected result, namely, the estimated parameter

More information

Selection of Smoothing Parameter for One-Step Sparse Estimates with L q Penalty

Selection of Smoothing Parameter for One-Step Sparse Estimates with L q Penalty Journal of Data Science 9(2011), 549-564 Selection of Smoothing Parameter for One-Step Sparse Estimates with L q Penalty Masaru Kanba and Kanta Naito Shimane University Abstract: This paper discusses the

More information

Nonsense Regressions due to Neglected Time-varying Means

Nonsense Regressions due to Neglected Time-varying Means Nonsense Regressions due to Neglected Time-varying Means Uwe Hassler Free University of Berlin Institute of Statistics and Econometrics Boltzmannstr. 20 D-14195 Berlin Germany email: uwe@wiwiss.fu-berlin.de

More information

arxiv: v2 [math.st] 12 Feb 2008

arxiv: v2 [math.st] 12 Feb 2008 arxiv:080.460v2 [math.st] 2 Feb 2008 Electronic Journal of Statistics Vol. 2 2008 90 02 ISSN: 935-7524 DOI: 0.24/08-EJS77 Sup-norm convergence rate and sign concentration property of Lasso and Dantzig

More information

Outlier detection and variable selection via difference based regression model and penalized regression

Outlier detection and variable selection via difference based regression model and penalized regression Journal of the Korean Data & Information Science Society 2018, 29(3), 815 825 http://dx.doi.org/10.7465/jkdi.2018.29.3.815 한국데이터정보과학회지 Outlier detection and variable selection via difference based regression

More information

Questions and Answers on Unit Roots, Cointegration, VARs and VECMs

Questions and Answers on Unit Roots, Cointegration, VARs and VECMs Questions and Answers on Unit Roots, Cointegration, VARs and VECMs L. Magee Winter, 2012 1. Let ɛ t, t = 1,..., T be a series of independent draws from a N[0,1] distribution. Let w t, t = 1,..., T, be

More information

Topic 4 Unit Roots. Gerald P. Dwyer. February Clemson University

Topic 4 Unit Roots. Gerald P. Dwyer. February Clemson University Topic 4 Unit Roots Gerald P. Dwyer Clemson University February 2016 Outline 1 Unit Roots Introduction Trend and Difference Stationary Autocorrelations of Series That Have Deterministic or Stochastic Trends

More information

The Iterated Lasso for High-Dimensional Logistic Regression

The Iterated Lasso for High-Dimensional Logistic Regression The Iterated Lasso for High-Dimensional Logistic Regression By JIAN HUANG Department of Statistics and Actuarial Science, 241 SH University of Iowa, Iowa City, Iowa 52242, U.S.A. SHUANGE MA Division of

More information

Properties of optimizations used in penalized Gaussian likelihood inverse covariance matrix estimation

Properties of optimizations used in penalized Gaussian likelihood inverse covariance matrix estimation Properties of optimizations used in penalized Gaussian likelihood inverse covariance matrix estimation Adam J. Rothman School of Statistics University of Minnesota October 8, 2014, joint work with Liliana

More information