High-Dimensional Sparse Econometrics. Models, an Introduction

Size: px

Start display at page:

Download "High-Dimensional Sparse Econometrics. Models, an Introduction"

Dora Hodge
5 years ago
Views:

1 High-Dimensional Sparse Econometric Models, an Introduction ICE, July 2011

2 Outline 1. High-Dimensional Sparse Econometric Models (HDSM) Models Motivating Examples for linear/nonparametric regression 2. Estimation Methods for linear/nonparametric regression l1 -penalized estimators formulations penalty parameter computational methods post-l1 -penalized estimators summary of rates of convergence 3. Plenty of model selection applications in Econometrics Results for IV Estimators Results for Quantile Regression Results for Variable Length Markov Chain

3 High-Dimensional Sparse Models We have a large number of parameters p, potentially larger than the sample size n The idea is that a low dimensional submodel captures very accurately all essential features of the full dimensional model. The key assumption is that the number of relevant parameters is smaller than the sample size fundamentally, a model selection problem One question is whether sparse models make sense. Another question is how to compute such models. Another question is what guarantees can we establish.

4 1. High-Dimensional Sparse Econometric Model HDSM. A response variable y i obeys y i = x i β 0 + ǫ i, ǫ i N(0,σ 2 ), i = 1,..., n where x i are fixed p-dimensional regressors, x i = (x ij, j = 1,..., p), E n [x 2 ij ] = 1. p possibly much larger than n. The key assumption is sparsity: p s := β 0 0 = 1{β 0j 0} n. j=1 This generalizes the standard large parametric econometric model by letting the identity T := support (β 0 ) of the relevant s regressors be unknown.

5 High p motivation transformations of basic regressors z i, x i = (P 1 (z i ),..., P p (z i )) for example, in wage regressions, Pj s are polynomials or B-splines in education and experience. or simply a very large list of regressors, a list of country characteristics in cross-country growth regressions (Barro & Lee) housing characteristics in hedonic regressions (American Housing Survey) price and product characteristics at the point of purchase (scanner data, TNS).

6 Sparseness The key assumption is that the number of non-zero regression coefficients is smaller than the sample size: s := β 0 0 = p 1{β 0j 0} n. j=1 The idea is that a low s-dimensional submodel accurately approximates the full p-dimensional model. The approximation error in fact is zero. Can easily relax this assumption and put in a non-zero approximation error. We can relax the Gaussianity of the error terms ǫ i substantially based on self-normalized moderate deviation theory results

7 High-Dimensional Sparse Econometric Model HDSM. For y i the response variable and z i the fixed elementary regressors y i = g(z i ) + ǫ i, E[ǫ i ] = 0, E[ǫ 2 i ] = σ2, i = 1,..., n Consider a p-dimensional dictionary of series terms, x i = (P 1 (z i ),..., P p (z i )) for example, B-splines, polynomials, and interactions, where p is possibly much larger than n. Expand via generalized series: g(z i ) = x i β 0 + r i, x i β 0 is the best s dimensional parametric approximation p s := β 0 0 = 1{β 0j 0} n. j=1 r i is the approximation error.

8 High-Dimensional Sparse Econometric Model HDSM. As a result we have an approximately sparse model: y i = x i β 0 + r i + ǫ i, i = 1,...,n The parametric part x i β 0 is the target, with s non-zero coefficients. We choose s to balance the approximation error with the oracle estimation error: E n [r 2 i ] σ 2 s n. This model generalizes the exact sparse model, by letting in approximation error and non-gaussian errors. This model also generalizes the standard series regression model by letting the identity T = support (β 0 ) of the most important s series terms be unknown.

9 Example 1: Series Models of Wage Function In this example, abstract away from the estimation questions. Consider a series expansion of the conditional expectation E[y i z i ] of wage y i given education z i. A conventional series approximation to the regression function is, for example, E[y i z i ] = β 1 P 1 (z i ) + β 2 P 2 (z i ) + β 3 P 3 (z i ) + β 4 P 4 (z i ) + r i where P 1,..., P 4 are first low-order polynomials or splines with a few knots.

10 Traditional Approximation of Expected Wage Function using Polynomials wage education In the figure, true regression function E[y i z i ] computed using U.S. Census data.

11 With the same number of parameters, can we a find a much better series approximation? For this to be possible, some high-order series terms in the expansion p E[y i z i ] = β 0k P k (z i ) + r i k=1 must have large coefficients. This indeed happens in our example, since E[y i z i ] exhibits sharp oscillatory behavior in some regions, which low degree polynomials fail to capture. And we find the right series terms by using l 1 -penalization methods

12 Lasso Approximation of Expected Wage Function using Polynomials and Dummies wage education

13 Traditional vs Lasso Approximation of Expected Wage Functions with Equal Number of Parameters wage education

14 Errors of Conventional and the lasso-based Sparse Approximations L 2 error L error Conventional Series Approximation lasso-based Series Approximation

15 Example 2: Cross-Country Growth Regression Relation between growth rate and initial per capita GDP GrowthRate = α 0 + α 1 log(gdp) + ǫ Test the convergence hypothesis, α 1 < 0, poor countries catch up with richer countries. Prediction from the classical Solow growth model. Linear regression only with log(gdp) yields: GrowthRate = log(gdp) + ǫ Test the convergence hypothesis conditional on covarites describing institutions and technological factors: GrowthRate = α 0 + α 1 log(gdp) + p β j X j + ǫ j=1

16 Example 2: Cross-Country Growth Regression In Barro-Lee data, we have p = 60 covariates, n = 90 observations. Need to select significant covariates among 60. Penalization Real GDP per capita (log) is in all models Parameter λ Additional Selected Variables Black Market Premium (log) Black Market Premium (log) Political Instability Black Market Premium (log) Political Instability Ratio of nominal government expenditure on defense to nominal GDP Ratio of import to GDP

17 Example 2: Cross-Country Growth Regression We find a sparse model for conditioning effects using l 1 - penalization. We re-fit the resulting model using LS. We find that the coefficient on lagged GDP is negative, and the confidence intervals, adjusted for model selection, exclude zero. Coefficient Confidence Set lagged GDP [ , ] Our findings fully support the conclusions reached in Barro and Lee and Barro and Sala-i-Martin.

18 2. Estimation via l 1 -penalty methods The canonical LS which minimizes Q(β) = E n [(y i x i β)2 ] is not consistent when dim(β) = p n. We can try minimizing an AIC/BIC type criterion function Q(β) + λ β 0, β 0 = p 1{β j 0} j=1 but this is not computationally feasible when p is large. Essentially a search over too many (non-nested) models. Technically, it is a combinatorial problem known to be NP-hard.

19 2. Estimation via l 1 -penalty methods It was required to balance and understand simultaneously computational issues }{{} hard combinatorial search and statistical issues }{{} lack of global identification Understanding that statistical aspects can help computational methods and vice-versa was crucial. Further developments in model selection estimation will rely heavily on computational advances. (MCMC is an example of a computational method with a profound impact on the statistics/econometrics literature)

20 2. Estimation via l 1 -penalty methods avoid computing: Q(β)+λ β 0 but get something meaninful A solution (Tibshirani, JRSS, 96) is to replace the l 0 - norm by the closest convex function, the l 1 -norm: β 1 = p β j j=1 lasso estimator β then minimizes Q(β) + λ β 1 this change makes the regularization function to be convex good computational methods for large scale problems many ways to combine the fit and regularization functions

21 Formulations Idea: balance fit Q(β) and penalty β 1 using a parameter λ Several ways to do it: min β Q(β) + λ β 1 lasso min β Q(β) : β 1 λ sieve min β β 1 : Q(β) λ l 1 -minimization method min β β 1 : Q(β) λ Dantzig selector min β Q(β) + λ β 1 lasso

22 Penalty parameter λ Motivation/heuristic is to choose the parameter λ so that β 0 is feasible for the problem but not too many more points. sieve: ideally set λ = β 0 1 l 1 -minimization: ideally set λ = σ 2 Dantzig selector: ideally set λ = Q(β 0 ) = 2E n [ǫ i x i ] Note that lasso and lasso are unrestricted minimizations. Their penalties are chosen to majorate the score at the true value, respectively λ > Q(β 0 ) and λ > Q(β 0 )

23 Heuristics for lasso via Convex Geometry Q(β) β T c true value (β 0T c = 0)

24 Heuristics for lasso via Convex Geometry Q(β) Q(β 0 ) + Q(β 0 ) β T c β T c true value (β 0T c = 0)

25 Heuristics for lasso via Convex Geometry λ β 1 λ = Q(β 0 ) Q(β) β T c true value (β 0T c = 0)

26 Heuristics for lasso via Convex Geometry λ β 1 λ > Q(β 0 ) Q(β) β T c true value (β 0T c = 0)

27 Heuristics via Convex Geometry Q(β) + λ β 1 λ > Q(β 0 ) β T c true value (β 0T c = 0)

28 Penalty parameters for lasso and lasso In the case of lasso, the optimal choice of penalty level λ, guaranteeing near-oracle performance, is λ =σ 2 2 log p/n P 2E n [ǫ i x i ] = Q(β 0 ). The choice relies on knowing σ, which may be apriori hard to estimate when p n. In the case of lasso the score is a pivotal random variable, with distribution independent of unknown parameters: Q(β 0 ) = E n[ǫ i x i ] = E n[(ǫ i /σ)x i ], E n [ǫ 2 i ] En [(ǫ i /σ) 2 ] hence for lasso the penalty λ does not depend on σ.

29 Computational Methods for lasso Big Picture of Computational Methods Method Problem Size Comp. Complexity Componentwise extremely large convergence ( ) First-order very large pn log p ε (up to ) Interior-point large (p + n) 3.5 log ( p+n) ε (up to 5000) Remark: Several implementations for these methods are available: SDPT3 or SeDuMi for interior-point methods TFOCS for first-order methods several downloadable codes for componentwise methods (easy to implement)

30 Computational Methods for lasso: Componentwise Componentwise Method. A common approach to solve multivariate optimization: (i) pick a component and fix all remaining components, (ii) improve the objective function along the chosen component, (iii) loop steps (i)-(iii) until convergence is achieved. This has been widely used when the minimization over a single component can be done very efficiently. This is the case of lasso. Its simple implementation also contributed for the widespread use of this approach.

31 Computational Methods for lasso: Componentwise For a current iterate β, let β j = (β 1,...,β j 1, 0,β j+1,...,β p ) : if 2E n [x ij (y i x i β j)] > λ, the minimum over β j is β j = ( 2E n [x ij (y i x i β j)] + λ ) /E n [x 2 ij ]; if 2E n [x ij (y i x i β j)] < λ the minimum over β j is β j = ( 2E n [x ij (y i x i β j)] λ ) /E n [x 2 ij ]; if 2 E n [x ij (y i x i β j)] λ the minimum over β j is β j = 0. Remark: Similar formulas can be computed for other methods like lasso.

32 Computational Methods for lasso: first-order methods First-order methods. The new generation of first-order methods focus on structured convex problems that can be cast as where min f(a(w) + b) + h(w) or min h(w) : A(w) + b K. w w f is a smooth function h is a structured function that is possibly non-differentiable or with extended values K is a structure convex set (typically a convex cone).

33 Computational Methods for lasso: first-order methods lasso is cast as with min f(a(w) + b) + h(w) w f( ) = 2 n, h( ) = λ 1, A = X, and b = Y Thus, Xβ Y 2 min + λ β 1 β n Remark: A number of different first-order methods have been proposed and analyzed. We just mention briefly one.

34 Computational Methods for lasso: first-order methods On every iteration for a given µ > 0 and current point β k is β(β k ) = arg min β 2E n [x i (y i x i βk )] β µ β βk 2 + λ n β 1. It follows that the minimization in β above is separable and can be solved by soft-thresholding, letting We update our iterate as m k j := β k j + 2E n [x ij (y i x i βk )]/µ β j (β k ) = sign ( m k j ) { } m k max j λ/µ, 0. Remark: The additional regularization term µ β β k 2 bring computational stability that allows the algorithm to enjoy better computational complexity properties.

35 Computational Methods for lasso: ipm Interior-point methods. Interior-point methods (ipms) solvers typically focus on solving conic programming problems in standard form, min w c w : Aw = b, w K. (2.1) The conic constraint will be biding at the optimal solution. These methods use a barrier function F K for the cone K that is strictly convex, smooth and F K (w) when w K, and for some µ > 0 proceed to compute w(µ) = arg min w c w + µf K (w) : Aw = b. (2.2) Letting µ 0, w(µ) converges to the solution (2.1). We can recover an approximate solution for (2.1) efficiently by using an approximate solution for w(µ k ) as a warm-start to approximate w(µ k+1 ), µ k > µ k+1.

36 Computational Methods for lasso: ipm Not all barriers are created equally, solving the problems w(µ) = arg min w c w + µf K (w) : Aw = b. (2.3) will be tractable only if F K is a special barrier (depends on K ). Three cones play a major role in optimization theory: Non-negative orthant: IR p + = {w IRp : w j 0, j = 1,...,p}, F K (w) = p j=1 log(w i) Second-order cone: Q n+1 = {(w, t) IR n IR + : t w }, F K (w) = log(t w ) Semi-definite positive matrices: S+ k k = {W IR k k : W = W, v Wv 0, for all v IR k }, F K (w) = log det(w)

37 Computational Methods for lasso: ipm These cones (and three others) are called self-scaled cones. Their cartesian products are also self-scaled. Therefore, when using ipms, it is VERY important computationally to arrive a formulation that is not only convex but also one that relies on self-scaled cones. It turns out that many tractable penalization methods (including l 1 but others as well) can be formulated with these three cones.

38 Computational Methods for lasso: ipm lasso min β Q(β) + λ β 1 to formulate the 1, let β = β + β, where β +,β IR p +. for any vector v IR n and any scalar t 0 we have that v v t is equivalent to (v,(t 1)/2) (t + 1)/2. }{{}}{{} =:u =:z min t,β +,β,u,z,v t n + λ p j=1 β + j + β j v = Y Xβ + + Xβ t = 1 + 2z t = 1 + 2u (v, u, z) Q n+2, t 0,β + IR p +, β IR p +.

39 Quick comparison on a very sparse model: s = 5, β 0j = 1 for j = 1,..., s, β 0j = 0 for j > s. Table: Average running time over 100 simulations. Different precisions across methods. n = 100, p = 500 Componentwise First-order Interior-point lasso lasso n = 200, p = 1000 Componentwise First-order Interior-point lasso lasso n = 400, p = 2000 Componentwise First-order Interior-point lasso lasso

40 Post-Model Selection Estimator Define the post-lasso estimator as follows: Step 1: Select the model using the lasso, T := support( β). Step 2: Apply ordinary least square (LS) to the selected model T. Remark: We could use lasso, Dantzig selector, sieve or l 1 -minimization instead of lasso.

41 Post-Model Selection Estimator Geometry β 2 β β 0 β 1 min β 1 : Q(β) γ, β R p In this example the correct model is selected, but note the shrinkage bias. The shrinkage bias can be removed by post model selection estimator, i.e. applying OLS to the selected model.

42 Post-Model Selection Estimator Geometry β 2 β β 0 β 1 min β 1 : Q(β) γ, β R p In this example the correct model was not selected. The selected model contains a junk regressor which can impact the post model selection estimator.

43 Issues on post model selection estimator lasso and others l 1 -penalized methods will successfully zero out lots of irrelevant regressors. however, these procedures are not perfect at model selection might include junk, exclude some relevant regressors, and these procedures bias/shrink the non-zero coefficient estimates towards zero. post model selection estimators are motivated to remove the bias towards zero but results need to account for possible misspecification in the selected model.

44 Regularity Condition A simple sufficient condition is as follows. Condition RSE. Take any C > 1. Then for all n large enough, the sc-sparse eigenvalues of the empirical design matrix E n [x i x i ], are bounded away from zero and from above; namely, 0 < κ min δ 0 sc E n [x i δ]2 δ 2 max δ 0 sc E n [x i δ]2 δ 2 κ <. (2.4) This holds under i.i.d. sampling if E[x i x i ] has eigenvalues bounded away from zero and above, and: x i has light tails (i.e., log-concave) and s log p = o(n); or bounded regressors max ij x ij K and s log 4 p = o(n).

45 Result 1: Rates for l 1 -penalized methods Theorem (Rates) Under condition RSE, wp 1 E n [x i β s log(n p) x i β 0] 2 σ n the rate is very close to the oracle rate s/n, obtainable when we know the true model T. p shows up only through a logarithmic term log p. for Dantzig selector: Candes and Tao (Annals of Statistics 2007) for lasso: Meinshausen and Yu (Annals of Statistics 2009), Bickel, Ritov, Tsybakov (Annals of Statistics 2009) for lasso: B., Chernozhukov and Wang (Biometrika, 2011)

46 Result 2: Post-Model Selection Estimator Theorem (Rate for Post-Model Selection Estimator) Under conditions of Result 1, wp 1 E n [x i β s PL x i β 0] 2 σ log(n p), n Under some further exceptional cases faster, up to σ Even though lasso does not in general perfectly select the relevant regressors, post-lasso performs at least as well. analysis account for a selected model that can be misspecified This result was first derived in the context of median regression B. and Chernozhukov (Annals of Statistics, 2011), and extended to least squares by B. and Chernozhukov (R & R Bernoulli, 2011). s n.

47 Monte Carlo In this simulation we used s = 6, p = 500, n = 100 y i = x i β 0 + ǫ i, ǫ i N(0,σ 2 ), β 0 = (1, 1, 1/2, 1/3, 1/4, 1/5, 0,...,0) x i N(0,Σ), Σ ij = (1/2) i j, σ 2 = 1 Ideal benchmark: Oracle estimator which runs OLS of y i on x i1,..., x i6. This estimator is not feasible outside Monte-Carlo.

48 Monte Carlo Results: Risk Risk: E[(x i β x β 0 ) 2 ] n = 100, p = 500 Estimation Risk LASSO Post-LASSO Oracle

49 Monte Carlo Results: Bias Bias: E[ β] β 0 2 n = 100, p = 500 Bias LASSO Post-LASSO

50 Applications of model selection

51 Gaussian Instrumental Variable Gaussian IV model y i = d i α + ǫ i, d i = D(z i ) + v i, (first stage) ( ǫi v i ) N ( ( σ 2 0, ǫ σ ǫv σ ǫv σv 2 )) can have additional controls w i in both equations; suppressed here, see references for details want to estimate optimal instrument D(z i ) = E[d i z i ] using a large list of technical instruments, x i := (x i1,..., x ip ) := (P 1 (z i ),..., P p (z i )), and we can write D(z i ) = x i β 0 + r i, with β 0 0 = s, E n [r 2 i ] σ 2 v s/n where p is large, possibly much larger than n. use estimated optimal instrument to estimate α.

52 2SLS with LASSO/Post-LASSO estimated Optimal IV 2SLS with LASSO/Post-LASSO estimated Optimal IV Step 1: Estimate D(z i ) = x i β using LASSO or Post-LASSO estimator. Step 2: α = [E n [d i D(zi ) ]] 1 E n [ D(z i )y i ]

53 2SLS with LASSO-based IV Theorem (2SLS with lasso-based IV) Under conditions of Result 1, and s 2 log 2 p = o(n), we have ( Λ ) 1/2 n( α α) d N(0, I), where Λ := σ 2 {E n [ D(z i ) D(z i ) ]} 1. For more results, see B., Chernozhukov, and Hansen (submitted, ArXiv 2010) The requirement s 2 log 2 p = o(n) can be relaxed to s log p = o(n) if split IV estimator is used. Extension to non-gaussian and heteroskedastic errors in B., Chen, Chernozhukov and Hansen (submitted, ArXiv 2010) required adjustments in the penalty function

54 Application of IV: Eminent Domain Estimate economic consequences of government take-over of property rights from individuals y i = economic outcome in a region i, e.g. housing price index d i = number of property take-overs decided in a court of law, by panels of 3 judges x i = demographic characteristics of judges, that are randomly assigned to panels: education, political affiliations, age, experience etc. f i = x i + various interactions of components of x i, a very large list p = p(f i ) = 344

55 Application of IV: Eminent Domain, Continued. Outcome is log of housing price index; endogenous variable is government take-over Can use 2 elementary instruments, suggested by real lawyers (Chen and Yeh, 2010) Can use all 344 instruments and select approximately the right set using LASSO. Estimator Instruments Price Effect Rob Std Error 2SLS SLS / LASSO IVs

56 l 1 -Penalized Quantile Regression a response variable y and p-dimensional covariates x u-th conditional quantile function of y given x is given by x β(u), β(u) R p. the dimension p is large, possibly much larger than the sample size n the true model β(u) is sparse having only s < n non-zero components; we don t know which. in population, β(u) minimizes Q u (β) = E[ρ u (y x β)], where ρ u (t) = (u 1{t 0})t = ut + + (1 u)t.

57 l 1 -Penalized Quantile Regression canonical estimator that minimizes Q u (β) = n 1 is not consistent when p n. n i=1 ρ u (y i x i β) l 1 -penalized quantile regression β(u) solves min Qu (β) + λ u β 1. β R p

58 l 1 -QR Rate Result Theorem (Rates) Under standard conditions, letting U (0, 1) be a compact set, wp 1 sup u U s log(n p) β(u) β(u) 2 P. n very similar to the uniform rate s log n/n when the support is known similar rates to lasso but also hold uniformly over u U although similar rates substantially different analysis

59 Application l 1 -QR: Conditional Independence Let X IR m denote a random vector. We are interest in determining conditional independence structure of X, for every pair of components i, j determine if X i X j X i, j. Conditional independence between two components of a Gaussian distributed random vector X N(0, Σ) has been well understood: For Σ = Var(X), X i and X j are independent conditioned on the other components if and only if Σ 1 ij = 0.

60 Application l 1 -QR: Conditional Independence Let X IR m denote a random vector. We are interest in determining conditional independence structure of X, for every pair of components i, j determine if X i X j X i, j. We consider a non-gaussian high-dimensional setting. if and only if X i X j X i, j Q Xi (u X i ) = Q Xi (u X i, j ) for all u (0, 1).

61 Application l 1 -QR: Conditional Independence In order to approximate conditional independence consider: we say that X i and X j are U-quantile uncorrelated if β i (u) arg min E[ρ u (X i β X i )] and β j (u) arg min E[ρ u (X j β X j )] are such that βj i(u) = 0 and βj i (u) = 0 for all u U.

62 Application l 1 -QR: Conditional Independence In order to approximate conditional independence consider: U (0, 1) a compact set of quantile indices a class of functions F We will say that if and only if for every u U X i U,F X j X i, j f i,j u (0, X i, j ) arg min f F E[ρ u(x i f(x j, X i, j ))] and f j,i u (0, X i, j ) arg min f F E[ρ u(x j f(x i, X i, j ))] This approach requires: series approximation for F; uniform estimation across all quantile indices u U; model selection.

63 Application l 1 -QR: Conditional Independence Let X be a vector of log returns. Edge (i, j) if X i X j X i, j FUND KDE TW OSIP HGR DJCO RMCF IBM CAU EWST MSFT DGSE SUNW ORCL BTFG AEPI JJSF PLXS LLTC DELL

64 Application l 1 -QR: Conditional Independence Consider IBM returns first. FUND KDE TW OSIP HGR DJCO RMCF IBM CAU EWST MSFT DGSE SUNW ORCL BTFG AEPI JJSF PLXS LLTC DELL

65 Application l 1 -QR: Conditional Independence Next consider all returns. FUND KDE TW OSIP HGR DJCO RMCF IBM CAU EWST MSFT DGSE SUNW ORCL BTFG AEPI JJSF PLXS LLTC DELL

66 Application l 1 -QR: Conditional Independence Next we investigate how this dependence structure is affected by downside movement of the market. This is particulary relevant for hedging since during downside movements of the market is when hedge is most needed. We quantify downside movement by conditioning on the observations with the largest 25%, 50%, 75% and 100% downside movement of the market. (The 100% corresponds to the overall dependence.)

67 Application l 1 -QR: Conditional Independence 100% (No downside movement) FUND KDE TW OSIP HGR DJCO RMCF IBM CAU EWST MSFT DGSE SUNW ORCL BTFG AEPI JJSF PLXS LLTC DELL

68 Application l 1 -QR: Conditional Independence 75% FUND KDE TW OSIP HGR DJCO RMCF IBM CAU EWST MSFT DGSE SUNW ORCL BTFG AEPI JJSF PLXS LLTC DELL

69 Application l 1 -QR: Conditional Independence 50% FUND KDE TW OSIP HGR DJCO RMCF IBM CAU EWST MSFT DGSE SUNW ORCL BTFG AEPI JJSF PLXS LLTC DELL

70 Application l 1 -QR: Conditional Independence 25% FUND KDE TW OSIP HGR DJCO RMCF IBM CAU EWST MSFT DGSE SUNW ORCL BTFG AEPI JJSF PLXS LLTC DELL

71 Application l 1 -QR: Conditional Independence 100% vs. 25% FUND FUND KDE TW KDE TW OSIP HGR OSIP HGR DJCO RMCF IBM CAU EWST MSFT DGSE DJCO RMCF IBM CAU EWST MSFT DGSE SUNW ORCL BTFG AEPI JJSF SUNW ORCL BTFG AEPI JJSF PLXS LLTC PLXS LLTC DELL DELL The right graph is much sparser than the left graph. Thus, when markets are moving down several variables become (nearly) conditionally independent. Therefore, hedging strategies that do not accounting for downside movements might not offer the desired protection exactly when it is needed the most.

72 Variable Length Markov Chains Consider a dynamic model: dynamic discrete choice models discrete stochastic dynamic programming The main feature we are interest: history of outcomes potentially impacts the probability distribution of the next outcome

73 Variable Length Markov Chains Consider at each time t agents choose: s t A = {Low, High}. Typical assumption: Markov Chain of order 1 (2 cond. prob.) t = 0 t = 1 L H

74 Variable Length Markov Chains Consider at each time t agents choose: s t A = {Low, High}. Markov Chain of order 2 requires to estimate 4 cond. prob. t = 0 t = 1 L H t = 2 LL LH HL HH

75 Variable Length Markov Chains Consider at each time t agents choose: s t A = {Low, High}. Markov Chain of order 3 requires to estimate 8 cond. prob. t = 0 t = 1 t = 2 LL L H LH HL HH t = 3 LLL LLH LHL LHH HLL HLH HHL HHH

76 Variable Length Markov Chains Consider at each time t agents choose: s t A = {Low, High}. Markov Chain of order K requires to estimate 2 K cond. prob. t = 0 t = 1 t = 2 t = 3 LLL LLH t = K.. LL L H LH LHL LHH.. HL HLL. HLH. HH HHL. HHH.

77 Variable Length Markov Chains Issues: K is unknown (potentially infinite) full K -order Markov chain becomes large quickly similar histories may lead to similar transition probabilities Motivation: bias (small model) vs. variance (large model) make the order of the Markov Chain depend on the history model selection balances: misspecification and length of confidence intervals

78 Variable Length Markov Chains Variable Length Markov Chain: very flexible but need to select the context tree t = 0 t = 1 t = 2 LL LH t = 3 LHL LHH. L H HL HH

79 Variable Length Markov Chains A penalty method based on (uniform) confidence intervals. Uniform estimation of transition probabilities over all possible histories x A 1 : sup P( x) P( x) x A 1 (bounds achieve the minimax rates for some classes) Results allow to establish: in dynamic discrete choice models: uniform estimation of dynamic marginal effects in discrete stochastic dynamic programming: uniform estimation of the value function

80 Conclusion

81 Conclusion 1. l 1 and post-l 1 -penalization methods offer a good way to control for confounding factors; offer a good way to select the series terms; are especially useful when p n

82 Conclusion 1. l 1 and post-l 1 -penalization methods offer a good way to control for confounding factors; offer a good way to select the series terms; are especially useful when p n 2. l 1 -penalized estimators achieve near-oracle rates

83 Conclusion 1. l 1 and post-l 1 -penalization methods offer a good way to control for confounding factors; offer a good way to select the series terms; are especially useful when p n 2. l 1 -penalized estimators achieve near-oracle rates 3. post-l 1 -penalized estimators achieve at least near-oracle rates

84 Conclusion 1. l 1 and post-l 1 -penalization methods offer a good way to control for confounding factors; offer a good way to select the series terms; are especially useful when p n 2. l 1 -penalized estimators achieve near-oracle rates 3. post-l 1 -penalized estimators achieve at least near-oracle rates 4. applications of model selection in econometrics estimators

85 References Bickel, Ritov and Tsybakov, Simultaneous analysis of Lasso and Dantzig selector, Annals of Statistics, Candes and T. Tao, The Dantzig selector: statistical estimation when p is much larger than n, Annals of Statistics, Meinshausen and Yu, Lasso-type recovery of sparse representations for high-dimensional data, Annals of Statistics, Tibshirani, Regression shrinkage and selection via the Lasso, J. Roy. Statist. Soc. Ser. B, 1996.

86 References Belloni and Chernozhukov, l 1 -Penalized Quantile Regression for High Dimensional Sparse Models, 2007, Annals of Statistics. Belloni and Chernozhukov, Post-l 1 -Penalized Estimators in High-Dimensional Linear Regression Models, ArXiv, Belloni, Chernozhukov and Wang, LASSO: Pivotal Recovery of Sparse Signals via Conic Programming, Biometrika, Belloni and Chernozhukov, High Dimensional Sparse Econometric Models: An Introduction, Ill-Posed Inverse Problems and High-Dimensional Estimation, Springer Lecture Notes in Statistics, Belloni, Chernozhukov, and Hansen, Estimation and Inference Methods for High-dimensional Sparse Econometric Models: a Review, 2011, submitted. Belloni and Oliveira, Approximate Group Context Tree: application to dynamic programming and dynamic choice models, 2011, submitted.

Ultra High Dimensional Variable Selection with Endogenous Variables

1 / 39 Ultra High Dimensional Variable Selection with Endogenous Variables Yuan Liao Princeton University Joint work with Jianqing Fan Job Market Talk January, 2012 2 / 39 Outline 1 Examples of Ultra High