Problem Set 7. Ideally, these would be the same observations left out when you

Similar documents
Program Evaluation with High-Dimensional Data

Sample Problems. Note: If you find the following statements true, you should briefly prove them. If you find them false, you should correct them.

The risk of machine learning

Regression Analysis. BUS 735: Business Decision Making and Research. Learn how to detect relationships between ordinal and categorical variables.

The logistic regression model is thus a glm-model with canonical link function so that the log-odds equals the linear predictor, that is

Least Squares Regression

y Xw 2 2 y Xw λ w 2 2

Shrinkage Methods: Ridge and Lasso

Least Squares Regression

41903: Introduction to Nonparametrics

Potential Outcomes Model (POM)

Uniform Post Selection Inference for LAD Regression and Other Z-estimation problems. ArXiv: Alexandre Belloni (Duke) + Kengo Kato (Tokyo)

INFERENCE APPROACHES FOR INSTRUMENTAL VARIABLE QUANTILE REGRESSION. 1. Introduction

Ultra High Dimensional Variable Selection with Endogenous Variables

Applied Econometrics (MSc.) Lecture 3 Instrumental Variables

A General Framework for High-Dimensional Inference and Multiple Testing

What s New in Econometrics. Lecture 1

Economics 583: Econometric Theory I A Primer on Asymptotics: Hypothesis Testing

Exercise sheet 6 Models with endogenous explanatory variables

cxx ab.ec Warm up OH 2 ax 16 0 axtb Fix any a, b, c > What is the x 2 R that minimizes ax 2 + bx + c

Single Index Quantile Regression for Heteroscedastic Data

Introduction to Regression Analysis. Dr. Devlina Chatterjee 11 th August, 2017

WISE International Masters

Propensity Score Analysis with Hierarchical Data

Introduction to Logistic Regression

Linear regression methods

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

SPARSE MODELS AND METHODS FOR OPTIMAL INSTRUMENTS WITH AN APPLICATION TO EMINENT DOMAIN

Econometrics - 30C00200

High Dimensional Sparse Econometric Models: An Introduction

Lecture 2 Machine Learning Review

Linear and Logistic Regression. Dr. Xiaowei Huang

A Course in Applied Econometrics. Lecture 2 Outline. Estimation of Average Treatment Effects. Under Unconfoundedness, Part II

The Slow Convergence of OLS Estimators of α, β and Portfolio. β and Portfolio Weights under Long Memory Stochastic Volatility

CHAPTER 7. + ˆ δ. (1 nopc) + ˆ β1. =.157, so the new intercept is = The coefficient on nopc is.157.

Review of Statistics

Honest confidence regions for a regression parameter in logistic regression with a large number of controls

Finding Relationships Among Variables

Sparse Linear Models (10/7/13)

Slide Set 14 Inference Basded on the GMM Estimator. Econometrics Master in Economics and Finance (MEF) Università degli Studi di Napoli Federico II

DATA MINING AND MACHINE LEARNING

Classification. Chapter Introduction. 6.2 The Bayes classifier

Habilitationsvortrag: Machine learning, shrinkage estimation, and economic theory

Least Absolute Value vs. Least Squares Estimation and Inference Procedures in Regression Models with Asymmetric Error Distributions

Identification and Estimation Using Heteroscedasticity Without Instruments: The Binary Endogenous Regressor Case

Statistical Data Mining and Machine Learning Hilary Term 2016

5. Let W follow a normal distribution with mean of μ and the variance of 1. Then, the pdf of W is

Testing Linear Restrictions: cont.

High-dimensional regression with unknown variance

high-dimensional inference robust to the lack of model sparsity

Machine learning, shrinkage estimation, and economic theory

Mostly Dangerous Econometrics: How to do Model Selection with Inference in Mind

DISCUSSION OF INFLUENTIAL FEATURE PCA FOR HIGH DIMENSIONAL CLUSTERING. By T. Tony Cai and Linjun Zhang University of Pennsylvania

A Significance Test for the Lasso

Biostatistics-Lecture 16 Model Selection. Ruibin Xi Peking University School of Mathematical Sciences

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Weighting Methods. Harvard University STAT186/GOV2002 CAUSAL INFERENCE. Fall Kosuke Imai

More on Roy Model of Self-Selection

Answer Key: Problem Set 5

arxiv: v1 [stat.ml] 30 Jan 2017

COMS 4721: Machine Learning for Data Science Lecture 6, 2/2/2017

Chapter 6. Panel Data. Joan Llull. Quantitative Statistical Methods II Barcelona GSE

ECONOMETRICS II (ECO 2401S) University of Toronto. Department of Economics. Spring 2013 Instructor: Victor Aguirregabiria

Regression: Ordinary Least Squares

Second Order Cone Programming, Missing or Uncertain Data, and Sparse SVMs

PROGRAM EVALUATION WITH HIGH-DIMENSIONAL DATA

ECON 4160, Autumn term Lecture 1

Selective Inference for Effect Modification

Lecture 3: Statistical Decision Theory (Part II)

The Illusion of Independence: High Dimensional Data, Shrinkage Methods and Model Selection

GMM - Generalized method of moments

Dealing With Endogeneity

Cross-Validation with Confidence

A Measure of Robustness to Misspecification

A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables

CSC 576: Variants of Sparse Learning

WISE MA/PhD Programs Econometrics Instructor: Brett Graham Spring Semester, Academic Year Exam Version: A

WISE MA/PhD Programs Econometrics Instructor: Brett Graham Spring Semester, Academic Year Exam Version: A

Achieving Optimal Covariate Balance Under General Treatment Regimes

Overview. Overview. Overview. Specific Examples. General Examples. Bivariate Regression & Correlation

GARCH Models Estimation and Inference

Multiple Regression. Peerapat Wongchaiwat, Ph.D.

Identification and Estimation Using Heteroscedasticity Without Instruments: The Binary Endogenous Regressor Case

What s New in Econometrics? Lecture 14 Quantile Methods

MFin Econometrics I Session 4: t-distribution, Simple Linear Regression, OLS assumptions and properties of OLS estimators

Comprehensive Examination Quantitative Methods Spring, 2018

regression Lie Wang Abstract In this paper, the high-dimensional sparse linear regression model is considered,

arxiv: v3 [stat.me] 9 May 2012

y response variable x 1, x 2,, x k -- a set of explanatory variables

LECTURE 10. Introduction to Econometrics. Multicollinearity & Heteroskedasticity

y(x) = x w + ε(x), (1)

High-dimensional regression modeling

A Sampling of IMPACT Research:

Applied Economics. Regression with a Binary Dependent Variable. Department of Economics Universidad Carlos III de Madrid

Motivation Sparse Signal Recovery is an interesting area with many potential applications. Methods developed for solving sparse signal recovery proble

Next, we discuss econometric methods that can be used to estimate panel data models.

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Stat 602 Exam 1 Spring 2017 (corrected version)

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

Econometrics I KS. Module 2: Multivariate Linear Regression. Alexander Ahammer. This version: April 16, 2018

Transcription:

Business 4903 Instructor: Christian Hansen Problem Set 7. Use the data in MROZ.raw to answer this question. The data consist of 753 observations. Before answering any of parts a.-b., remove 253 observations which will be used for an out-ofsample comparison in part c. answered problem on Problem Set 5. Ideally, these would be the same observations left out when you a. Estimate the model E[inlf X = {kidslt6, kidsge6, age, educ, repwage, f aminc, exper}] = p K (X) β by lasso with penalty parameter chosen by cross-validation. how you construct the dictionary of approximating functions p K (X)? Carefully explain b. Estimate the model E[inlf X = {kidslt6, kidsge6, age, educ, repwage, f aminc, exper}] = Λ(p K (X) β) where Λ( ) is the logistic cdf by l -penalized logistic regression with penalty parameter chosen by cross-validation. c. Use the 253 observations you held out to compare the estimates obtained in parts a.-b. and the estimates obtained in problem on Problem set 5. Calculate the mean square forecast error as 253 i hold out (ĝ j(x i ) y i ) 2 and the misclassification rate 253 i hold out (ŷ j,i y i ) where ŷ j,i is the Bayes-classifier based on model j - i.e. ŷ j,i = (ĝ j (x i ).5), and ĝ j (x i ) are the fitted values obtained from each of the competing models. Which procedure performs best according to each metric? Do the performance discrepancies seem large? [Note: Assuming independent sampling, you can compute a standard error for the mean square forecast error and for the misclassification rate conditioning on the estimated model.] 2. Use the data in CreditCardDefault.xls to answer this question. The data consist of 30000 observations. Before answering any of parts a.-b., remove 0000 observations which will be used for an out-of-sample comparison in part c. Ideally, these would be the same observations left out when you answered problem 2 on Problem Set 5.

a. Estimate the model E[default x,..., x 2 3] = p K (x,..., x 2 3) β for x,..., x 2 3 defined in CreditCardDefault.des by lasso with penalty parameter chosen by cross-validation. Carefully explain how you construct the dictionary of approximating functions p K (x,..., x 2 3)? b. Estimate the model E[default x,..., x 2 3] = Λ(p K (x,..., x 2 3) β) for x,..., x 2 3 defined in CreditCardDefault.des and Λ( ) the logistic cdf by l -penalized logistic regression with penalty parameter chosen by cross-validation. c. Use the 0000 observations you held out to compare the estimates obtained in parts a.-b. and the estimates obtained in problem 2 on Problem set 5. Calculate the mean square forecast error as 0000 i hold out (ĝ j(x i ) y i ) 2 and the misclassification rate 0000 i hold out (ŷ j,i y i ) where ŷ j,i is the Bayes-classifier based on model j - i.e. ŷ j,i = (ĝ j (x i ).5), and ĝ j (x i ) are the fitted values obtained from each of the competing models. Which procedure performs best according to each metric? Do the performance discrepancies seem large? [Note: Assuming independent sampling, you can compute a standard error for the mean square forecast error and for the misclassification rate conditioning on the estimated model.] 3. [Post-selection inference example] Consider a linear regression model Y i = β X,i + β 2,n X 2,i + ε i X,i = π X X 2,i + v i where (ε i, v i ) N(0, diag(σ 2, κ 2 )) are iid across i and independent of the n 2 design matrix X and the second equation simply parameterizes the covariance between X and X 2. Suppose that the parameter of interest is β and that you are unsure of whether X 2 should be included in the model (i.e. you are unsure about whether β 2,n = 0). Let β and β 2 be the conventional OLS estimators of β and β 2,n obtained by regressing Y on X and X 2, and let and denote the corresponding s β s β2 standard error estimators. Let ˇβ be the conventional OLS estimator of β obtained by regressing Y on X (i.e. excluding X 2 ), and let s ˇβ be the corresponding standard error estimator. a. Under the assumption that β 2,n = 0, show that ˇβ is consistent, asymptotically normal, and has variance less than or equal to that of β. 2

b. Let = β 2 t β2 s. Show that > c β2 Pr( t β2 n ) for c n = log(n)t n 2 (.975) 2 log(n) when β 2,n = δ with δ > 0 where t n 2 denotes the cdf of t n 2 random variable. Similarly show that Pr( t β2 c n ) when β 2,n = 0. c. Consider the estimator β = ( t β2 > c n ) β + ( t β2 c n ) ˇβ. Derive the asymptotic properties of β when (i) β 2,n = δ with δ > 0 and when (ii) β 2,n = 0. Show that t β = β β ( t β2 >c n)s β +( t β2 c n)s ˇβ d N(0, ) when the null hypothesis H 0 : β = β is true when (i) β 2,n = δ with δ > 0 and when (ii) β 2,n = 0. Conclude that β is as efficient as the oracle estimator that knows whether β 2,n = 0 despite having to learn β 2,n from the data. d. Now consider a sequence of models where β 2,n = ban n for some b with b > 0 and a n > 0 with a n and an log(n) 0. Show that β is consistent. Show that n( β β ) where β is the true value of β when E[X X 2 ] 0 (when π X 0). Explain in words what this sequence of models captures and why this is an appropriate thought experiment for understanding the finite-sample properties of the estimator β. What do the results thus far suggest about the desirability of using β in finite samples? e. Note that the moment condition underlying the definition of β is E[(Y X β X 2 β 2 )X ] = 0. Show that this moment condition is satisfied at the true values of β and β 2. Show that this moment condition does not have the orthogonality property discussed in Section 7 of the notes in that the derivative with respect to β 2 evaluated at the true parameter values is not 0. f. Let π Y denote the least squares coefficient obtained from regressing Y on X 2 with associated standard error estimator s πy and t-statistic for testing π Y = 0 of t Y = π Y s πy. Similarly, let π X denote the least squares coefficient obtained from regressing X on X 2 with associated standard error estimator s πx and t-statistic for testing π X = 0 of t X = π X s πx. Define a new estimator β = (( t Y > c n ) or ( t X > c n )) β + (( t Y c n ) and ( t X c n )) ˇβ. Show that n( β β ) d N(0, V ) when β 2,n = 0, β 2,n = δ, or β 2,n = anb n regardless of the value of E[X X 2 ] (and the sequence a n ). Suggest a consistent estimator of V. g. Note that the moment condition underlying the definition of β is E[((Y E[Y X 2 ]) (X E[Y X 2 ])β )(X E[Y X 2 ])] = 0 where E[Y X 2 ] = π Y X 2 and E[X X 2 ] = π X X 2. Show 3

that this moment condition is satisfied at the true values of β, π Y, and π X. Show that this moment condition has the orthogonality property discussed in Section 7 of the notes in that the derivative with respect to the nuisance parameters π Y and π X evaluated at the true parameter values is 0. h. Design a simulation experiment to illustrate the potential consequences of the lack of uniformity of the estimator β on inference. Specifically, you should be able to design a simulation experiment where the distribution of β is strongly bimodal and size of tests based on t β is far from the nominal level. Show the robustness of β within this design in that the normal approximation provides a sensible approximation to the distribution of β across simulation replications and tests based on the t-statistic formed using β and the suggested estimator of V from part (f) have approximately correct size. [A test is not uniform with respect to a class of models if there are sequences of models within this class that lead to the test being size distorted even in large samples. Uniformity of inference with respect to sensible sequences of models is very important in practice as well-designed sequences are much better able to capture actual finite-sample performance of estimators.] 4. Recall that a doubly robust estimator of an average treatment effect is given by ÂTE robust = [ Di (Y i ĝ (X i )) ( D ] i)(y i ĝ 0 (X i )) + ĝ (X i ) ĝ 0 (X i ) n ê(x i ) ê(x i ) = n i= ψ i. i= Belloni, Chernozhukov, and Hansen (204) show that n(âte robust AT E) d N(0, V ) where V = n n i= ( ψ i ÂTE robust ) 2 p V when l -penalized estimation is used to form ĝ ( ), ĝ 0 ( ), and ê( ) under regularity conditions including having iid data and the assumption that these functions are approximately sparse. By approximately sparse, we mean that g (X i ) = X i β + r, g 0 (X i ) = X i β 0 + r 0, and e(x i ) = Λ(X i γ) + r e with max{ β 0, β 0 0, γ 0 } s where r, r 0, and r e are approximation errors that satisfy max{e[r 2 ], E[r2 0 ], E[r2 e]} = O(s/n) and s2 log(p) 3 n 0. Consider the data in restatw.dat which contains the data used in the 40(k) example in the first two lectures. Use e40 (eligibility for a 40(k) plan) as the treatment variable and net tf a (net total financial assets as the dependent variable). The argument for exogeneity of a 40(k) plan 4

relies on conditioning on characteristics that might be associated to a person s decision to take a job and saving preferences. Potential control variables are age, inc (income), f size (family size), educ (years of education), marr (marital status), male, twoearn (part of a two-earner household), db (has a defined benefit pension), and pira (has an IRA). a. Construct an approximating dictionary to use in l -penalized estimation using the control variables above. Carefully explain your choices in choosing what functions to put in the dictionary. Explain intuitively what the sparsity assumption requires in terms of this example and within the context of the dictionary you have chosen. Does it seem plausible that this assumption would be satisfied? b. Estimate g ( ) and g 0 ( ) using lasso with penalty weights that are appropriate under heteroscedasticity and penalty parameter λ = 2.2 nφ ( (./ log(n))/2p) where p is the number of elements in your dictionary and n is the appropriate sample size. (Note that n will not be the same for estimating g 0 and g. Note that the λ given is appropriate for solving β P arg min b n (y i x ib) 2 + λ n i= i= p ˆφ j b j. Some lasso implementations use different scalings such as β P arg min (y i x b 2n ib) 2 + λ p ˆφ j b j which would require alteration of the penalty parameter so that you are solving the same problem.) Which variables are selected to approximate each function? Do these variables make sense? Explain. Should we conclude that the selected variables are the true variables in the sense that we have captured the correct model for E[Y X] and E[Y 0 X]? Explain. c. Estimate e( ) using l -penalized logistic regression with λ =. nφ ( (./ log(n))/2p) where p is the number of elements in your dictionary and n is the appropriate sample size. (Be careful about scaling again. This λ is appropriate for solving γ arg min log-likelihood(y i, x i, g) + λ p ˆ gj. g n n i= j= j= j= 5

If the l -penalized logistic regression function you are using uses a different scaling, you will need to adjust λ appropriately.) Which variables are selected? Do these variables make sense? Explain. Should we conclude that the selected variables are the true variables in the sense that we have captured the correct model for E[D X]? Explain. d. Take the selected variables for estimating g and estimate ĝ (X) by unpenalized least squares regression of Y on these selected variables in the subsample of observations with D =. Take the selected variables for estimating g 0 and estimate ĝ 0 (X) by unpenalized least squares regression of Y on these selected variables in the subsample of observations with D = 0. Take the selected variables for estimating e(x) and estimate ê(x). Form fitted values for g, g 0, and e for each observation in the data. Using these fitted values obtain ÂTE robust and estimate it s standard error. e. Estimate g ( ) and g 0 ( ) using lasso with penalty weights that are appropriate under homoscedasticity and with penalty parameter chosen by cross-validation. Which variables are selected to approximate each function? Do these variables make sense? Explain. Do these results differ appreciably from those in part b.? Should we conclude that the selected variables are the true variables in the sense that we have captured the correct model for E[Y X] and E[Y 0 X]? Explain. f. Estimate e( ) using l -penalized logistic regression with λ chosen by cross-validation. Which variables are selected? Do these variables make sense? Explain. Do these results differ appreciably from those in part c.? Should we conclude that the selected variables are the true variables in the sense that we have captured the correct model for E[D X]? Explain. g. Take the estimated models from e. and f. for g 0, g and e (i.e. just use the coefficient estimates that come directly out of the estimation) to form fitted values for g, g 0, and e for each observation in the data. Using these fitted values obtain ÂTE robust and estimate it s standard error. Are these results appreciably different from those obtained in part d.? Explain the significance of the similarity or difference. 6