Problem Set 7. Ideally, these would be the same observations left out when you

Business 4903 Instructor: Christian Hansen Problem Set 7. Use the data in MROZ.raw to answer this question. The data consist of 753 observations. Before answering any of parts a.-b., remove 253 observations which will be used for an out-ofsample comparison in part c. answered problem on Problem Set 5. Ideally, these would be the same observations left out when you a. Estimate the model E[inlf X = {kidslt6, kidsge6, age, educ, repwage, f aminc, exper}] = p K (X) β by lasso with penalty parameter chosen by cross-validation. how you construct the dictionary of approximating functions p K (X)? Carefully explain b. Estimate the model E[inlf X = {kidslt6, kidsge6, age, educ, repwage, f aminc, exper}] = Λ(p K (X) β) where Λ( ) is the logistic cdf by l -penalized logistic regression with penalty parameter chosen by cross-validation. c. Use the 253 observations you held out to compare the estimates obtained in parts a.-b. and the estimates obtained in problem on Problem set 5. Calculate the mean square forecast error as 253 i hold out (ĝ j(x i ) y i ) 2 and the misclassification rate 253 i hold out (ŷ j,i y i ) where ŷ j,i is the Bayes-classifier based on model j - i.e. ŷ j,i = (ĝ j (x i ).5), and ĝ j (x i ) are the fitted values obtained from each of the competing models. Which procedure performs best according to each metric? Do the performance discrepancies seem large? [Note: Assuming independent sampling, you can compute a standard error for the mean square forecast error and for the misclassification rate conditioning on the estimated model.] 2. Use the data in CreditCardDefault.xls to answer this question. The data consist of 30000 observations. Before answering any of parts a.-b., remove 0000 observations which will be used for an out-of-sample comparison in part c. Ideally, these would be the same observations left out when you answered problem 2 on Problem Set 5.

a. Estimate the model E[default x,..., x 2 3] = p K (x,..., x 2 3) β for x,..., x 2 3 defined in CreditCardDefault.des by lasso with penalty parameter chosen by cross-validation. Carefully explain how you construct the dictionary of approximating functions p K (x,..., x 2 3)? b. Estimate the model E[default x,..., x 2 3] = Λ(p K (x,..., x 2 3) β) for x,..., x 2 3 defined in CreditCardDefault.des and Λ( ) the logistic cdf by l -penalized logistic regression with penalty parameter chosen by cross-validation. c. Use the 0000 observations you held out to compare the estimates obtained in parts a.-b. and the estimates obtained in problem 2 on Problem set 5. Calculate the mean square forecast error as 0000 i hold out (ĝ j(x i ) y i ) 2 and the misclassification rate 0000 i hold out (ŷ j,i y i ) where ŷ j,i is the Bayes-classifier based on model j - i.e. ŷ j,i = (ĝ j (x i ).5), and ĝ j (x i ) are the fitted values obtained from each of the competing models. Which procedure performs best according to each metric? Do the performance discrepancies seem large? [Note: Assuming independent sampling, you can compute a standard error for the mean square forecast error and for the misclassification rate conditioning on the estimated model.] 3. [Post-selection inference example] Consider a linear regression model Y i = β X,i + β 2,n X 2,i + ε i X,i = π X X 2,i + v i where (ε i, v i ) N(0, diag(σ 2, κ 2 )) are iid across i and independent of the n 2 design matrix X and the second equation simply parameterizes the covariance between X and X 2. Suppose that the parameter of interest is β and that you are unsure of whether X 2 should be included in the model (i.e. you are unsure about whether β 2,n = 0). Let β and β 2 be the conventional OLS estimators of β and β 2,n obtained by regressing Y on X and X 2, and let and denote the corresponding s β s β2 standard error estimators. Let ˇβ be the conventional OLS estimator of β obtained by regressing Y on X (i.e. excluding X 2 ), and let s ˇβ be the corresponding standard error estimator. a. Under the assumption that β 2,n = 0, show that ˇβ is consistent, asymptotically normal, and has variance less than or equal to that of β. 2

b. Let = β 2 t β2 s. Show that > c β2 Pr( t β2 n ) for c n = log(n)t n 2 (.975) 2 log(n) when β 2,n = δ with δ > 0 where t n 2 denotes the cdf of t n 2 random variable. Similarly show that Pr( t β2 c n ) when β 2,n = 0. c. Consider the estimator β = ( t β2 > c n ) β + ( t β2 c n ) ˇβ. Derive the asymptotic properties of β when (i) β 2,n = δ with δ > 0 and when (ii) β 2,n = 0. Show that t β = β β ( t β2 >c n)s β +( t β2 c n)s ˇβ d N(0, ) when the null hypothesis H 0 : β = β is true when (i) β 2,n = δ with δ > 0 and when (ii) β 2,n = 0. Conclude that β is as efficient as the oracle estimator that knows whether β 2,n = 0 despite having to learn β 2,n from the data. d. Now consider a sequence of models where β 2,n = ban n for some b with b > 0 and a n > 0 with a n and an log(n) 0. Show that β is consistent. Show that n( β β ) where β is the true value of β when E[X X 2 ] 0 (when π X 0). Explain in words what this sequence of models captures and why this is an appropriate thought experiment for understanding the finite-sample properties of the estimator β. What do the results thus far suggest about the desirability of using β in finite samples? e. Note that the moment condition underlying the definition of β is E[(Y X β X 2 β 2 )X ] = 0. Show that this moment condition is satisfied at the true values of β and β 2. Show that this moment condition does not have the orthogonality property discussed in Section 7 of the notes in that the derivative with respect to β 2 evaluated at the true parameter values is not 0. f. Let π Y denote the least squares coefficient obtained from regressing Y on X 2 with associated standard error estimator s πy and t-statistic for testing π Y = 0 of t Y = π Y s πy. Similarly, let π X denote the least squares coefficient obtained from regressing X on X 2 with associated standard error estimator s πx and t-statistic for testing π X = 0 of t X = π X s πx. Define a new estimator β = (( t Y > c n ) or ( t X > c n )) β + (( t Y c n ) and ( t X c n )) ˇβ. Show that n( β β ) d N(0, V ) when β 2,n = 0, β 2,n = δ, or β 2,n = anb n regardless of the value of E[X X 2 ] (and the sequence a n ). Suggest a consistent estimator of V. g. Note that the moment condition underlying the definition of β is E[((Y E[Y X 2 ]) (X E[Y X 2 ])β )(X E[Y X 2 ])] = 0 where E[Y X 2 ] = π Y X 2 and E[X X 2 ] = π X X 2. Show 3

that this moment condition is satisfied at the true values of β, π Y, and π X. Show that this moment condition has the orthogonality property discussed in Section 7 of the notes in that the derivative with respect to the nuisance parameters π Y and π X evaluated at the true parameter values is 0. h. Design a simulation experiment to illustrate the potential consequences of the lack of uniformity of the estimator β on inference. Specifically, you should be able to design a simulation experiment where the distribution of β is strongly bimodal and size of tests based on t β is far from the nominal level. Show the robustness of β within this design in that the normal approximation provides a sensible approximation to the distribution of β across simulation replications and tests based on the t-statistic formed using β and the suggested estimator of V from part (f) have approximately correct size. [A test is not uniform with respect to a class of models if there are sequences of models within this class that lead to the test being size distorted even in large samples. Uniformity of inference with respect to sensible sequences of models is very important in practice as well-designed sequences are much better able to capture actual finite-sample performance of estimators.] 4. Recall that a doubly robust estimator of an average treatment effect is given by ÂTE robust = [ Di (Y i ĝ (X i )) ( D ] i)(y i ĝ 0 (X i )) + ĝ (X i ) ĝ 0 (X i ) n ê(x i ) ê(x i ) = n i= ψ i. i= Belloni, Chernozhukov, and Hansen (204) show that n(âte robust AT E) d N(0, V ) where V = n n i= ( ψ i ÂTE robust ) 2 p V when l -penalized estimation is used to form ĝ ( ), ĝ 0 ( ), and ê( ) under regularity conditions including having iid data and the assumption that these functions are approximately sparse. By approximately sparse, we mean that g (X i ) = X i β + r, g 0 (X i ) = X i β 0 + r 0, and e(x i ) = Λ(X i γ) + r e with max{ β 0, β 0 0, γ 0 } s where r, r 0, and r e are approximation errors that satisfy max{e[r 2 ], E[r2 0 ], E[r2 e]} = O(s/n) and s2 log(p) 3 n 0. Consider the data in restatw.dat which contains the data used in the 40(k) example in the first two lectures. Use e40 (eligibility for a 40(k) plan) as the treatment variable and net tf a (net total financial assets as the dependent variable). The argument for exogeneity of a 40(k) plan 4

relies on conditioning on characteristics that might be associated to a person s decision to take a job and saving preferences. Potential control variables are age, inc (income), f size (family size), educ (years of education), marr (marital status), male, twoearn (part of a two-earner household), db (has a defined benefit pension), and pira (has an IRA). a. Construct an approximating dictionary to use in l -penalized estimation using the control variables above. Carefully explain your choices in choosing what functions to put in the dictionary. Explain intuitively what the sparsity assumption requires in terms of this example and within the context of the dictionary you have chosen. Does it seem plausible that this assumption would be satisfied? b. Estimate g ( ) and g 0 ( ) using lasso with penalty weights that are appropriate under heteroscedasticity and penalty parameter λ = 2.2 nφ ( (./ log(n))/2p) where p is the number of elements in your dictionary and n is the appropriate sample size. (Note that n will not be the same for estimating g 0 and g. Note that the λ given is appropriate for solving β P arg min b n (y i x ib) 2 + λ n i= i= p ˆφ j b j. Some lasso implementations use different scalings such as β P arg min (y i x b 2n ib) 2 + λ p ˆφ j b j which would require alteration of the penalty parameter so that you are solving the same problem.) Which variables are selected to approximate each function? Do these variables make sense? Explain. Should we conclude that the selected variables are the true variables in the sense that we have captured the correct model for E[Y X] and E[Y 0 X]? Explain. c. Estimate e( ) using l -penalized logistic regression with λ =. nφ ( (./ log(n))/2p) where p is the number of elements in your dictionary and n is the appropriate sample size. (Be careful about scaling again. This λ is appropriate for solving γ arg min log-likelihood(y i, x i, g) + λ p ˆ gj. g n n i= j= j= j= 5

If the l -penalized logistic regression function you are using uses a different scaling, you will need to adjust λ appropriately.) Which variables are selected? Do these variables make sense? Explain. Should we conclude that the selected variables are the true variables in the sense that we have captured the correct model for E[D X]? Explain. d. Take the selected variables for estimating g and estimate ĝ (X) by unpenalized least squares regression of Y on these selected variables in the subsample of observations with D =. Take the selected variables for estimating g 0 and estimate ĝ 0 (X) by unpenalized least squares regression of Y on these selected variables in the subsample of observations with D = 0. Take the selected variables for estimating e(x) and estimate ê(x). Form fitted values for g, g 0, and e for each observation in the data. Using these fitted values obtain ÂTE robust and estimate it s standard error. e. Estimate g ( ) and g 0 ( ) using lasso with penalty weights that are appropriate under homoscedasticity and with penalty parameter chosen by cross-validation. Which variables are selected to approximate each function? Do these variables make sense? Explain. Do these results differ appreciably from those in part b.? Should we conclude that the selected variables are the true variables in the sense that we have captured the correct model for E[Y X] and E[Y 0 X]? Explain. f. Estimate e( ) using l -penalized logistic regression with λ chosen by cross-validation. Which variables are selected? Do these variables make sense? Explain. Do these results differ appreciably from those in part c.? Should we conclude that the selected variables are the true variables in the sense that we have captured the correct model for E[D X]? Explain. g. Take the estimated models from e. and f. for g 0, g and e (i.e. just use the coefficient estimates that come directly out of the estimation) to form fitted values for g, g 0, and e for each observation in the data. Using these fitted values obtain ÂTE robust and estimate it s standard error. Are these results appreciably different from those obtained in part d.? Explain the significance of the similarity or difference. 6