Fall, 2007 Nonlinear Econometrics. Theory: Consistency for Extremum Estimators. Modeling: Probit, Logit, and Other Links.

Similar documents
Lecture 6: Discrete Choice: Qualitative Response

Closest Moment Estimation under General Conditions

Closest Moment Estimation under General Conditions

Analogy Principle. Asymptotic Theory Part II. James J. Heckman University of Chicago. Econ 312 This draft, April 5, 2006

A Local Generalized Method of Moments Estimator

The Uniform Weak Law of Large Numbers and the Consistency of M-Estimators of Cross-Section and Time Series Models

LECTURE 10: NEYMAN-PEARSON LEMMA AND ASYMPTOTIC TESTING. The last equality is provided so this can look like a more familiar parametric test.

AGEC 661 Note Fourteen

Working Paper No Maximum score type estimators

Linear Models for Regression CS534

Linear Models for Regression CS534

Linear Regression With Special Variables

The Influence Function of Semiparametric Estimators

Chapter 4: Asymptotic Properties of the MLE

Some Preliminary Market Research: A Googoloscopy. Parametric Links for Binary Response. Some Preliminary Market Research: A Googoloscopy

Inference in non-linear time series

Classical regularity conditions

Asymptotics for Nonlinear GMM

Model Selection and Geometry

Econometric Analysis of Games 1

Bayesian Inference in GLMs. Frequentists typically base inferences on MLEs, asymptotic confidence

Lecture 2: Consistency of M-estimators

5.1 Approximation of convex functions

Identification in Discrete Choice Models with Fixed Effects

Latent Variable Models for Binary Data. Suppose that for a given vector of explanatory variables x, the latent

Short Questions (Do two out of three) 15 points each

Ultra High Dimensional Variable Selection with Endogenous Variables

Lecture 1: Supervised Learning

Nonparametric Identification of a Binary Random Factor in Cross Section Data - Supplemental Appendix

IDENTIFICATION OF THE BINARY CHOICE MODEL WITH MISCLASSIFICATION

Comments on: Panel Data Analysis Advantages and Challenges. Manuel Arellano CEMFI, Madrid November 2006

6.1 Variational representation of f-divergences

Quantile methods. Class Notes Manuel Arellano December 1, Let F (r) =Pr(Y r). Forτ (0, 1), theτth population quantile of Y is defined to be

STA 216, GLM, Lecture 16. October 29, 2007

Generalized Linear Models. Last time: Background & motivation for moving beyond linear

The properties of L p -GMM estimators

1 Lyapunov theory of stability

McGill University Math 354: Honors Analysis 3

Mathematics Qualifying Examination January 2015 STAT Mathematical Statistics

A GENERAL THEOREM ON APPROXIMATE MAXIMUM LIKELIHOOD ESTIMATION. Miljenko Huzak University of Zagreb,Croatia

Review and continuation from last week Properties of MLEs

Problem Set 3: Bootstrap, Quantile Regression and MCMC Methods. MIT , Fall Due: Wednesday, 07 November 2007, 5:00 PM

Introduction and Preliminaries

Empirical Processes: General Weak Convergence Theory

Section 8: Asymptotic Properties of the MLE

Statistics 203: Introduction to Regression and Analysis of Variance Course review

Introduction to the Mathematical and Statistical Foundations of Econometrics Herman J. Bierens Pennsylvania State University

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

L6. NONLINEAR AND BINARY REGRESSION, PREDICTIVE EFFECTS, AND M-ESTIMATION

Mathematics Ph.D. Qualifying Examination Stat Probability, January 2018

ECON Introductory Econometrics. Lecture 11: Binary dependent variables

1 Glivenko-Cantelli type theorems

SDS : Theoretical Statistics

Support Vector Machines

Theory of Maximum Likelihood Estimation. Konstantin Kashin

EPSY 905: Fundamentals of Multivariate Modeling Online Lecture #7

MLE and GMM. Li Zhao, SJTU. Spring, Li Zhao MLE and GMM 1 / 22

Data-analysis and Retrieval Ordinal Classification

Graduate Econometrics I: Maximum Likelihood I

Density estimation Nonparametric conditional mean estimation Semiparametric conditional mean estimation. Nonparametrics. Gabriel Montes-Rojas

u( x) = g( y) ds y ( 1 ) U solves u = 0 in U; u = 0 on U. ( 3)

Binary choice. Michel Bierlaire

AP-Optimum Designs for Minimizing the Average Variance and Probability-Based Optimality

University of Pavia. M Estimators. Eduardo Rossi

Lecture 14 More on structural estimation

Lecture 1: Entropy, convexity, and matrix scaling CSE 599S: Entropy optimality, Winter 2016 Instructor: James R. Lee Last updated: January 24, 2016

Semiparametric Models and Estimators

Proofs for Large Sample Properties of Generalized Method of Moments Estimators

STAT5044: Regression and Anova

Monte Carlo Studies. The response in a Monte Carlo study is a random variable.

Selection on Observables: Propensity Score Matching.

Extremum Sieve Estimation in k-out-of-n Systems. 1

GMM and SMM. 1. Hansen, L Large Sample Properties of Generalized Method of Moments Estimators, Econometrica, 50, p

Most Continuous Functions are Nowhere Differentiable

STA216: Generalized Linear Models. Lecture 1. Review and Introduction

Time Series Analysis Fall 2008

Statistical Estimation

Robustness and Distribution Assumptions

Stochastic Equicontinuity in Nonlinear Time Series Models. Andreas Hagemann University of Notre Dame. June 10, 2013

Linear Regression Models P8111

VALUATION USING HOUSEHOLD PRODUCTION LECTURE PLAN 15: APRIL 14, 2011 Hunt Allcott

Extremal Solutions of Differential Inclusions via Baire Category: a Dual Approach

Lasso Maximum Likelihood Estimation of Parametric Models with Singular Information Matrices

Parametric identification of multiplicative exponential heteroskedasticity ALYSSA CARLSON

ECONOMETRICS II (ECO 2401S) University of Toronto. Department of Economics. Spring 2013 Instructor: Victor Aguirregabiria

Testing an Autoregressive Structure in Binary Time Series Models

Quantile Processes for Semi and Nonparametric Regression

Estimation of Dynamic Regression Models

Classification. Chapter Introduction. 6.2 The Bayes classifier

Default Priors and Effcient Posterior Computation in Bayesian

The main results about probability measures are the following two facts:

Function Approximation

Metric spaces and metrizability

Asymptotic inference for a nonstationary double ar(1) model

Binary Models with Endogenous Explanatory Variables

Gibbs Sampling in Latent Variable Models #1

Lecture 8: Information Theory and Statistics

Introductory Analysis I Fall 2014 Homework #9 Due: Wednesday, November 19

Generalized Linear Models for Non-Normal Data

Graduate Econometrics I: What is econometrics?

Transcription:

14.385 Fall, 2007 Nonlinear Econometrics Lecture 2. Theory: Consistency for Extremum Estimators Modeling: Probit, Logit, and Other Links. 1

Example: Binary Choice Models. The latent outcome is defined by the equation We observe y i = x i β ε i, ε i F( ). y i = 1(y i 0). The cdf F is completely known. Then P(y i = 1 x i ) = P(ε i x i β x i ) = F(x i β). We can then estimate β using the log-likelihood function Qˆ(β) = E n [y i ln F(x i β)+(1 y i)ln(1 F(x i β))]. The resulting MLE are CAN and efficient, under regularity conditions. The story is that a consumer may have two choices, the utility from one choice is y i = x i β ε i and the utility from the other is normalized to be 0. We need to estimate the 2

parameters of the latent utility based on the observed choice frequencies. Estimands: The key parameters to estimate are P[y i = 1 x i ] and the partial effects of the kind P[y i = 1 x i ] = f(xi β)βj, x ij where f = F. These parameters are functionals of parameter β and the link F. Choices of F: exp(t) Logit: F(t) = Λ(t) = 1+exp(t). Probit: F(t) = Φ(t), standard normal cdf. Cauchy: F(t) = C(t) = 1 2 + 1 π arctan(t), the Cauchy cdf. Gosset: F(t) = T(t, v), the cdf of t-variable with v degrees of freedom. Choice of F( ) can be important especially in the tails. The prediction of small and large

probabilities by different models may differ substantially. For example, Probit and Cauchit links, Φ(t) and C(t), have drastically different tail behavior and give different predictions for the same value of the index t. See Figure 1 for a theoretical example and Figure 2 for an empirical example. In the housing example, y i records whether a person owns a house or not, and x i consists of an intercept and person s income.

P P Plots for various Links F Probability 0.0 0.2 0.4 0.6 0.8 1.0 Normal Cauchy Logit 0.0 0.2 0.4 0.6 0.8 1.0 Normal Probability

Normal Cauchy Logit Linear 0.0 0.2 0.4 Prediction 0.6 0.8 1.0 Predicted Probabilities of Owning a House 0.0 0.2 0.4 0.6 0.8 1.0 Normal Prediction Choice of F ( ) can be less important when using flexible functional forms. Indeed, for any F we can approximate P [yi = 1 x] F [P (x) β], where P (x) is a collection of approximating functions, for example, splines, powers, or other

series, as we know from the basic approximation theory. This point is illustrated in the following Figure, which deals with an earlier housing example, but uses flexible functional form with P(x) generated as a cubic spline with ten degrees of freedom. Flexibility is great for this reason, but of course has its own price: additional parameters lead to increased estimation variance.

Normal Cauchy Logit Linear 0.0 0.2 0.4 Prediction 0.6 0.8 1.0 Flexibly Pedicted Probabilities of Owning a House 0.0 0.2 0.4 0.6 0.8 1.0 Normal Prediction Discussion: Choice of the right model is a hard and very important problem is statistical analysis. Using flexible links, e.g. t-link vs. probit link, comes at a cost of additional pa rameters. Using flexible expansions inside the links also requires additional parameters. Flex ibility reduces the approximation error (bias),

but typically increases estimation variance. Thus an optimal choice has to balance these terms. A useful device for choosing best performing models is cross-validation. Reading: A very nice reference is R. Koenker and J. Yoon (2006) who provide a systematic treatment of the links, beyond logits and probits, with an application to propensity score matching. The estimates plotted in the figures were produced using R language s package glm. The Cauchy, Gosset, and other links for this package were implemented by Koenker and Yoon (2006). (http://www.econ.uiuc.edu/~roger/research/links/links.html) References: Koenker, R. and J. Yoon (2006), Parametric Links for Binary Response Models.

1. Extremum Consistency Extremum estimator θˆ= argmin Q(θ). θ Θ As we have seen in Lecture 1, extremum estimator encompasses nearly all estimators (MLE, M, GMM, MD). Thus, consistency of all these estimators will follow from consistency of extremum estimators. Theorem 1 (Extremum Consistency Theorem) If i) (Identification) Q(θ) is uniquely minimized at the true parameter value θ 0 ; ii) (Compactness) Θ is compact; iii) (Continuity) Q( ) is continuous; iv) (uniform convergence) p p sup θ Θ Qˆ(θ) Q(θ) 0; then θˆ θ 0. Intuition: Just draw a picture that describes the theorem.

Proof: The proof has two steps: 1) we need to show Q(θˆ) p Q(θ 0 ) using the assumed uniform convergence of Qˆ to Q, and 2) we need to show that this implies that θˆ must be close to θ 0 using continuity of Q and the fact that θ 0 is the unique minimizer. Step 1. By uniform convergence, p Qˆ(θˆ) Q(θˆ) 0 and Qˆ(θ 0 ) Q(θ 0 ) 0. Also, by Qˆ(θˆ) and Q(θ 0 ) being minima, Therefore Q(θ 0 ) Q(θ 0 ) Q(θˆ) and Qˆ(θˆ) Qˆ(θ 0 ). implying that Q(θˆ) = Qˆ(θˆ) + [Q(θˆ) Qˆ(θˆ)] Qˆ(θ 0 ) + [Q(θˆ) Qˆ(θˆ)] = Q(θ 0 ) + [Qˆ(θ 0 ) Q(θ 0 ) + Q(θˆ) Qˆ(θˆ)] }{{}, o p (1) Q(θ 0 ) Q(θˆ) Q(θ 0 ) + o p (1) p

p It follows that Q(θˆ) Q(θ 0 ). Step 2. By compactness of Θ and continuity of Q(θ), for any open subset N of Θ containing θ 0, we have that inf Q(θ) > Q(θ 0 ). θ/ N Indeed, inf θ/ N Q(θ) = Q(θ ) for some θ Θ. By identification, Q(θ ) > Q(θ 0 ). p But, by Q(θˆ) Q(θ 0 ), we have Q(θˆ) < inf Q(θ) θ/ N with probability approaching one, and hence θˆ N with probability approaching one.q.e.d. Discussion: 1. The first and big step in verifying consis tency is that the limit Q(θ) is minimized at the

parameter value we want to estimate. Q(θ) is always minimized at something under the stated conditions, but it may not be the value we want. If we have verified this part, we can proceed to verify other conditions. Because Q(θ) is either an average (in M-estimation) or a transformation of an average (in GMM), we can verify the uniform convergence by an application of one of the uniform laws of large numbers (ULLN). The starred comments below are technical in nature. 2. The theorem can be easily generalized in several directions: (a) continuity of the limit objective function can be replaced by the lowersemi-continuity, and (b) uniform convergence can be replaced by, for example, the one-sided uniform convergence: sup θ Θ Qˆ(θ) Q(θ) p 0 and Qˆ(θ 0 ) Q(θ 0 ) p 0.

Van der Vaart (1998) presents several generalizations. Knight (1998) Epi-convergence and Stochastic Equi-semicontinuity is another good reference. 3. Measurability conditions are subsumed here in the general definition of p (convergence in outer probability). Element X n = sup θ Θ Qˆ(θ) Q(θ) converges in outer probability to 0, if for any ɛ > 0 there exists a measurable event Ω ɛ which contains the event {X n > ɛ}, not necessarily measurable, and Pr(Ω ɛ ) 0. We also denote this statement by Pr (X n > ɛ) 0. This approach, due to Hoffman-Jorgensen, allows to bypass measurability issues and focus more directly on the problem. Van der Vaart (1998) provides a very good introduction to the Hoffman-Jorgensen approach.

3. Uniform Convergence and Uniform Laws of Large Numbers. Here we discuss how to verify the uniform convergence of the sample objective function Q to the limit objective function Q. It is important to emphasize here that pointwise convergence, that is, that for each θ, Q (θ) p Q(θ), does not imply uniform convergence sup θ Θ Q (θ) Q(θ) p 0. It is also easy to see that pointwise convergence is not sufficient for consistency of argmins. (Just draw a moving spike picture). There are two approaches: a. Establish uniform convergence results directly using a ready uniform laws of large numbers that rely on the fact that many objective functions are averages or functions of averages.

b. Convert familiar pointwise convergence in probability and pointwise laws of large numbers to uniform convergence in probability and the uniform laws of large numbers. This conversion is achieved by the means of the stochastic equi-continuity. a. Direct approach Often Qˆ(θ) is equal to a sample average (Mestimation) or a quadratic form of a sample average (GMM). The following result is useful for showing conditions iii) and iv) of Theorem 1, because it gives conditions for continuity of expectations and uniform convergence of sample averages to expectations. Lemma 1 (Uniform Law of Large Numbers) Suppose Qˆ(θ) = E n [q(z i, θ)]. Assume that data

z 1,..., z n is stationary and strongly mixing and i) q(z, θ) is continuous at each θ with probability one; ii) Θ is compact, and iii) E [sup θ Θ q(z, θ) ] < ; then Q(θ) = E[q(z, θ)] is continuous on Θ p and sup θ Θ Qˆ(θ) Q(θ) 0. Condition i) allows for q(z, θ) to be discontinuous (with probability zero). For instance, q(z, θ) = 1(x θ > 0) would satisfy this condition at any θ where Pr(x θ = 0) = 0. Theoretical Exercise. Prove the ULLN lemma. Lemmas that follow below may be helpful for this. b. Equicontinuity Approach. First of all, we require that the sample cri terion function converges to the limit crite rion function pointwise, that is for, each θ,

Q (θ) p Q(θ). We also require that Q has equicontinuous behavior. Let ωˆ(ɛ) := sup Q (θ) Q (θ ) θ θ ɛ be a measure of oscillation of a function over small neigborhoods (balls of radius ɛ). This measure is called the modulus of continuity. We require that these oscillations vanish with the size of the neigborhoods, in large samples. Thus, we rule out moving spikes that can be consistent with pointwise convergence but destroy the uniform convergence. Lemma 2 For a compact parameter space Θ, suppose that the following conditions hold: i) (pointwise or finite-dimensional convergence) Q (θ) p Q(θ) for each θ Θ. ii) (stochastic equicontinuity) For any η > 0 there is ɛ > 0 and a sufficiently large n such that for all n > n Pr {ωˆ(ɛ) > η} η.

Then sup θ Θ Q (θ) Q(θ) p 0. The last condition is a stochastic version of the Arzella-Ascolli s equicontinuity condition. It is also worth noting that this condition is equivalent to the finite-dimensional approximability: the function Qˆ(θ) can be approximated well by its values Qˆ(θ j ), j = 1,..., K on a finite fixed grid in large sample. The uniform convergence then follows from the convergence in probability of Qˆ(θ j ), j = 1,..., K to Q(θ j ), j = 1,..., K. Theoretical Exercise. Prove Lemma 2, using the comment above as a hint. Simple sufficient conditions for stochastic equicontinuity and thus uniform convergence include the following. Lemma 3 (Uniform Holder Continuity) Suppose Q (θ) p Q(θ) for each θ Θ compact. Then

the uniform convergence holds if the uniform Holder condition holds: for some h > 0, uniformly for all θ, θ Θ Q(θ) Q (θ ) B T θ θ h, B T = O p (1). Lemma 4 (Convexity Lemma) Suppose Q (θ) p Q(θ) for almost every θ Θ 0, where Θ 0 is an open convex set in R k containing compact parameter set Θ, and that Q (θ) is convex on Θ 0. Then the uniform convergence holds over Θ. The limit Q(θ) is necessarily convex on Θ 0. Theoretical Exercise. Prove Lemma 3 and 4. Proving Lemma 3 is trivial, while proving Lemma 4 is not. However, the result is quite intuitive, since, a convex function cannot oscillate too much over small neighborhood, provided this function is pointwise close to a continuous function. We can draw a picture to convince ourself of this. Lemma 4 is a stochastic version of a result from convex analysis. Pollard (1991) provides a nice proof of this result.

A good technical reference on the uniform convergence is Andrews (1994). 4. Consistency of Extremum Estimators under Convexity. It is possible to relax the assumption of compactness, if something is done to keep Qˆ(θ) from turning back down towards its minimum, i.e. prevent ill-posedness. One condition that has this effect, and has many applications, is convexity (for minimization, concavity for maximization). Convexity also facilitates efficient computation of estimators. Theorem 2 (Consistency with Convexity): If Qˆ(θ) is convex on an open convex set Θ R d and (i) Q(θ) is uniquely minimized at θ 0 Θ. p ; (ii) (iii) Qˆ(θ) Q(θ) for almost every θ Θ p ; then θˆ θ 0. Theoretical Exercise. Prove this lemma. Another consequence of convexity is that the uniform

convergence is replaced by pointwise convergence. This is a consequence of the Convexity Lemma.

References: Pollard, D. Asymptotics for least absolute deviation regression estimators. Econometric Theory 7 (1991), no. 2, 186 199. Andrews, D. W. K. Empirical process methods in econometrics. Handbook of econometrics, Vol. IV, 2247 2294, Handbooks in Econom., 2, North-Holland, Amsterdam, 1994.