NONLINEAR LEAST-SQUARES ESTIMATION 1. INTRODUCTION

Similar documents
Closest Moment Estimation under General Conditions

Closest Moment Estimation under General Conditions

Empirical Processes: General Weak Convergence Theory

5.1 Approximation of convex functions

Much of this Chapter is unedited or incomplete.

Classical regularity conditions

The properties of L p -GMM estimators

The Uniform Weak Law of Large Numbers and the Consistency of M-Estimators of Cross-Section and Time Series Models

is a Borel subset of S Θ for each c R (Bertsekas and Shreve, 1978, Proposition 7.36) This always holds in practical applications.

Statistics 612: L p spaces, metrics on spaces of probabilites, and connections to estimation

MAT 570 REAL ANALYSIS LECTURE NOTES. Contents. 1. Sets Functions Countability Axiom of choice Equivalence relations 9

The deterministic Lasso

Proofs for Large Sample Properties of Generalized Method of Moments Estimators

Supplementary Material for Nonparametric Operator-Regularized Covariance Function Estimation for Functional Data

SOME CONVERSE LIMIT THEOREMS FOR EXCHANGEABLE BOOTSTRAPS

Distance between multinomial and multivariate normal models

Goodness-of-Fit Tests for Time Series Models: A Score-Marked Empirical Process Approach

Some Background Material

Statistics 300B Winter 2018 Final Exam Due 24 Hours after receiving it

Set, functions and Euclidean space. Seungjin Han

Generalized Neyman Pearson optimality of empirical likelihood for testing parameter hypotheses

THE LASSO, CORRELATED DESIGN, AND IMPROVED ORACLE INEQUALITIES. By Sara van de Geer and Johannes Lederer. ETH Zürich

Statistics 581 Revision of Section 4.4: Consistency of Maximum Likelihood Estimates Wellner; 11/30/2001

Notes, March 4, 2013, R. Dudley Maximum likelihood estimation: actual or supposed

Model Selection and Geometry

Statistical Properties of Numerical Derivatives

CONDITIONED POISSON DISTRIBUTIONS AND THE CONCENTRATION OF CHROMATIC NUMBERS

Additive functionals of infinite-variance moving averages. Wei Biao Wu The University of Chicago TECHNICAL REPORT NO. 535

Local Asymptotic Normality

1 Lyapunov theory of stability

1 Topology Definition of a topology Basis (Base) of a topology The subspace topology & the product topology on X Y 3

BALANCING GAUSSIAN VECTORS. 1. Introduction

ASYMPTOTIC EQUIVALENCE OF DENSITY ESTIMATION AND GAUSSIAN WHITE NOISE. By Michael Nussbaum Weierstrass Institute, Berlin

Theory and Applications of Stochastic Systems Lecture Exponential Martingale for Random Walk

Math 259: Introduction to Analytic Number Theory Functions of finite order: product formula and logarithmic derivative

Math 259: Introduction to Analytic Number Theory Functions of finite order: product formula and logarithmic derivative

arxiv:submit/ [math.st] 6 May 2011

Stochastic Optimization with Inequality Constraints Using Simultaneous Perturbations and Penalty Functions

Real Analysis Problems

Laplace s Equation. Chapter Mean Value Formulas

Concentration behavior of the penalized least squares estimator

Bayesian Regularization

MAT 257, Handout 13: December 5-7, 2011.

Least squares under convex constraint

Tools from Lebesgue integration

Estimation and Inference of Quantile Regression. for Survival Data under Biased Sampling

The concentration of the chromatic number of random graphs

DA Freedman Notes on the MLE Fall 2003

A uniform central limit theorem for neural network based autoregressive processes with applications to change-point analysis

Introduction to Empirical Processes and Semiparametric Inference Lecture 02: Overview Continued

Existence and Uniqueness

Metric Spaces and Topology

Chapter 4: Asymptotic Properties of the MLE

Refined Bounds on the Empirical Distribution of Good Channel Codes via Concentration Inequalities

A GENERAL THEOREM ON APPROXIMATE MAXIMUM LIKELIHOOD ESTIMATION. Miljenko Huzak University of Zagreb,Croatia

Semiparametric posterior limits

LECTURE 15: COMPLETENESS AND CONVEXITY

From now on, we will represent a metric space with (X, d). Here are some examples: i=1 (x i y i ) p ) 1 p, p 1.

PCA with random noise. Van Ha Vu. Department of Mathematics Yale University

An introduction to Mathematical Theory of Control

LAW OF LARGE NUMBERS FOR THE SIRS EPIDEMIC

Functional Analysis. Franck Sueur Metric spaces Definitions Completeness Compactness Separability...

An elementary proof of the weak convergence of empirical processes

Random Bernstein-Markov factors

Convergence of generalized entropy minimizers in sequences of convex problems

9 Brownian Motion: Construction

LARGE DEVIATION PROBABILITIES FOR SUMS OF HEAVY-TAILED DEPENDENT RANDOM VECTORS*

Observer design for a general class of triangular systems

Asymptotic efficiency of simple decisions for the compound decision problem

STAT 200C: High-dimensional Statistics

STAT 992 Paper Review: Sure Independence Screening in Generalized Linear Models with NP-Dimensionality J.Fan and R.Song

Mid Term-1 : Practice problems

For iid Y i the stronger conclusion holds; for our heuristics ignore differences between these notions.

Institut für Mathematik

Pointwise convergence rates and central limit theorems for kernel density estimators in linear processes

Spectra of adjacency matrices of random geometric graphs

APPROXIMATING CONTINUOUS FUNCTIONS: WEIERSTRASS, BERNSTEIN, AND RUNGE

Verifying Regularity Conditions for Logit-Normal GLMM

A COUNTEREXAMPLE TO AN ENDPOINT BILINEAR STRICHARTZ INEQUALITY TERENCE TAO. t L x (R R2 ) f L 2 x (R2 )

Topics in Theoretical Computer Science: An Algorithmist's Toolkit Fall 2007

Theorems. Theorem 1.11: Greatest-Lower-Bound Property. Theorem 1.20: The Archimedean property of. Theorem 1.21: -th Root of Real Numbers

From Boltzmann Equations to Gas Dynamics: From DiPerna-Lions to Leray

X n D X lim n F n (x) = F (x) for all x C F. lim n F n(u) = F (u) for all u C F. (2)

ẋ = f(x, y), ẏ = g(x, y), (x, y) D, can only have periodic solutions if (f,g) changes sign in D or if (f,g)=0in D.

Maximum Likelihood Estimation

A REPRESENTATION FOR THE KANTOROVICH RUBINSTEIN DISTANCE DEFINED BY THE CAMERON MARTIN NORM OF A GAUSSIAN MEASURE ON A BANACH SPACE

BERNOULLI ACTIONS AND INFINITE ENTROPY

SHARP BOUNDARY TRACE INEQUALITIES. 1. Introduction

Notes on Gaussian processes and majorizing measures

Introduction to Empirical Processes and Semiparametric Inference Lecture 12: Glivenko-Cantelli and Donsker Results

Nonparametric regression with martingale increment errors

NOWHERE LOCALLY UNIFORMLY CONTINUOUS FUNCTIONS

Brownian motion. Samy Tindel. Purdue University. Probability Theory 2 - MA 539

MAJORIZING MEASURES WITHOUT MEASURES. By Michel Talagrand URA 754 AU CNRS

Lecture 35: December The fundamental statistical distances

Econometrica Supplementary Material

A CRITERION FOR THE EXISTENCE OF TRANSVERSALS OF SET SYSTEMS

f(x)dx = lim f n (x)dx.

On the Existence of Almost Periodic Solutions of a Nonlinear Volterra Difference Equation

Theoretical Statistics. Lecture 14.

Transcription:

NONLINEAR LEAST-SQUARES ESTIMATION DAVID POLLARD AND PETER RADCHENKO ABSTRACT. The paper uses empirical process techniques to study the asymptotics of the least-squares estimator for the fitting of a nonlinear regression function. By combining and extending ideas of Wu and Van de Geer, it establishes new consistency and central limit theorems that hold under only second moment assumptions on the errors. An application to a delicate example of Wu s illustrates the use of the new theorems, leading to a normal approximation to the least-squares estimator with unusual logarithmic rescalings. 1. INTRODUCTION Consider the model where we observe y i for i =1,...,nwith 1 y i = f i θ+u i where θ Θ. The unobserved f i can be random or deterministic functions. The unobserved errors u i are random with zero means and finite variances. The index set Θ might be infinite dimensional. Later in the paper it will prove convenient to also consider triangular arrays of observations. Think of fθ = f 1 θ,...,f n θ and u = u 1,...,u n as points in R n. The model specifies a surface M Θ = {fθ : θ Θ} in R n. The vector of observations Date: Original version April 1995; revised March 2002; revised 2 December 2003. 2000 Mathematics Subject Classification. Primary 62E20. Secondary: 60F05, 62G08, 62G20. Key words and phrases. Nonlinear least squares, empirical processes, subgaussian, consistency, central limit theorem. 1

NONLINEAR LEAST-SQUARES ESTIMATION 2 y =y 1,...,y n is a random point in R n. The least squares estimator LSE θ n is defined to minimize the distance of y to M Θ, θ n = argmin θ Θ y fθ 2, where denotes the usual Euclidean norm on R n. Many authors have considered the behavior of θ n as n when the y i are generated by the model for a fixed θ 0 in Θ. When the f i are deterministic, it is natural to express assertions about convergence of θ n in terms of the n-dimensional Euclidean distance κ n θ 1,θ 2 := fθ 1 fθ 2. For example, Jennrich 1969 took Θ to be a compact subset of R p, the errors {u i } to be iid with zero mean and finite variance, and the f i to be continuous functions in θ. He proved strong consistency of the least squares estimator under the assumption that n 1 κ n θ 1,θ 2 2 converges uniformly to a continuous function that is zero if and only if θ 1 = θ 2.Healsogave conditions for asymptotic normality. Under similar assumptions Wu 1981, Theorem 1 proved that existence of a consistent estimator for θ 0 implies that 2 κ n θ :=κ n θ, θ 0 at each θ θ 0. If Θ is finite, the divergence 2 is also a sufficient condition for the existence of a consistent estimator Wu 1981, Theorem 2. His main consistency result his Theorem 3 may be reexpressed as a general convergence assertion. Theorem 1. Suppose the {f i } are deterministic functions indexed by a subset Θ of R p. Suppose also that sup i varu i < and κ n θ at each θ θ 0.LetSbe a bounded subset of Θ\{θ 0 } and let R n := inf θ S κ n θ. Suppose there exist constants {L i } such that i sup θ S f i θ f i θ 0 L i for each i; ii f i θ 1 f i θ 2 L i θ 1 θ 2 for all θ 1,θ 2 S; iii i n L2 i = ORn α for some α<4. Then P{ θ n / S eventually } =1.

NONLINEAR LEAST-SQUARES ESTIMATION 3 Remark. Assumption i implies i n L2 i κ nθ 2 for each θ in S, which forces R n. If Θ is compact and if for each θ θ 0 there is a neighborhood S = S θ satisfying the conditions of the Lemma then θ n θ 0 almost surely. Wu s paper was the starting point for several authors. For example, both Lai 1994 and Skouras 2000 generalized Wu s consistency results by taking the functions f i θ = f i θ, ω as random processes indexed by θ. They took the {u i } as a martingale difference sequence, with {f i } a predictable sequence of functions with respect to a filtration {F i }. Another line of development is typified by the work of Van de Geer 1990 and Van de Geer and Wegkamp 1996. They took f i θ =fx i,θ, where F = {f θ :Θ} is a set of deterministic functions and the x i,u i are iid pairs. In fact they identified Θ with the index set F. Under a stronger assumption about the errors, they established rates of convergence of κ n θ n in terms of L 2 entropy conditions on F, using empirical process methods that were developed after Wu s work. The stronger assumption was that the errors are uniformly subgaussian. In general, we say that a random variable W has a subgaussian distribution if there exists some finite τ such that P exptw exp 1 2 τ 2 t 2 for all t R. We write τw for the smallest such τ. Van de Geer assumed that sup i τu i <. Remark. Notice that we must have PW =0when W is subgaussian because the linear term in the expansion of P exptw must vanish. When PW =0, subgaussianity is equivalent to existence of a finite constant β for which P{ W x} 2 exp x 2 /β 2 for all x 0. In our paper we try to bring together the two lines of development. Our main motivation for working on nonlinear least squares was an example presented by Wu 1981, page 507. He noted that his consistency theorem has difficulties with a simple model, 3 f i θ =λi µ for θ =λ, µ Θ, a compact subset of R R +.

NONLINEAR LEAST-SQUARES ESTIMATION 4 For example, condition 2 does not hold for θ 0 =0, 0 at any θ with µ>1/2. When θ 0 =λ 0, 1/2, Wu s method fails in a more subtle way, but Van de Geer 1990 s method would work if the errors satisfied the subgaussian assumption. In Section 4, under only second moment assumptions on the errors, we establish weak consistency and a central limit theorem. The main idea behind all the proofs ours, as well as those of Wu and Van de Geer is quite simple. The LSE also minimizes the random function 4 G n θ := y fθ 2 u 2 = κ n θ 2 2Z n θ where Z n θ :=u fθ u fθ 0. In particular, G n θ n G n θ 0 =0,thatis, 1 2 κ n θ n 2 Z n θ n. For every subset S of Θ, 5 P{ θ n S} P{ θ S : Z n θ 1κ 2 nθ 2 } 4Psup Z n θ 2 / inf κ nθ 4. θ S θ S The final bound calls for a maximal inequality for Z n. Our methods for controlling Z n are similar in spirit to those of Van de Geer. Under her subgaussian assumption, for every class of real functions {g θ : θ Θ}, the process 6 Xθ = i n u ig i θ has subgaussian increments. Indeed, if τu i τ for all i then 2 2 τ Xθ 1 Xθ 2 i n τu i 2 g i θ 1 g i θ 2 τ 2 gθ 1 gθ 2 2. That is, the tails of Xθ 1 Xθ 2 are controlled by the n-dimensional Euclidean distance between the vectors gθ 1 and gθ 2. This property allowed her to invoke a chaining bound similar to our Theorem 2 for the tail probabilities of sup θ S Z n θ for various annuli S = {θ : R κ n θ < 2R}. Under the weaker second moment assumption on the errors, we apply symmetrization arguments to transform to a problem involving a new process Znθ with conditionally subgaussian increments. We avoid Van de Geer s subgaussianity assumption at the cost of extra Lipschitz conditions on the f i θ, analogous to Assumption ii of Theorem 1,

NONLINEAR LEAST-SQUARES ESTIMATION 5 which lets us invoke chaining bounds for conditional second moments of sup θ S Znθ for various S. In Section 3 we prove a new consistency theorem Theorem 3 and a new central limit theorem Theorem 4, generalizing Wu s Theorem 5 for nonlinear LSEs. More precisely, our consistency theorem corresponds to an explicit bound for P{κ n θ n R}, but we state the result in a form that makes comparison with Theorem 1 easier. Our Theorem does not imply almost sure convergence, but our techniques could easily be adapted to that task. We regard the consistency as a preliminary to the next level of asymptotics and not as an end in itself. We describe the local asymptotic behavior with another approximation result, Theorem 4, which can easily be transformed into a central limit theorem under a variety of mild assumptions on the {u i } errors. For example, in Section 4 we apply the Theorem to the model 3, to sharpen the consistency result at θ 0 =1, 1/2 into the approximation 7 l 1/2 n λ n 1,l 3/2 n 1 2 µ n = i n u iζ i,n + o p 1 where l n := log n and ζ i,n = i 1/2 l 1/2 n 2 6 2. 6 24 l i /l n The sum on the right-hand side of 7 is of order O p 1 when sup i varu i <. Ifthe{u i } are also identically distributed, the sum has a limiting multivariate normal distribution. 2. MAXIMAL INEQUALITIES Assumption ii of Theorem 1 ensures that the increments Z n θ 1 Z n θ 2 are controlled by the ordinary Euclidean distance in Θ; we allow for control by more general metrics. Wu invoked a maximal inequality for sums of random continuous processes, a result derived from a bound on the covering numbers for M θ as a subset of R n under the usual Euclidean distance; we work with covering numbers for other metrics.

NONLINEAR LEAST-SQUARES ESTIMATION 6 Definition 1. Let T,d be a metric space. The covering number Nδ, S, d is defined as the size of the smallest δ-net for S, that is, the smallest N for which there are points t 1,...,t N in T with min i ds, t i δ for every s in S. Remark. The definition is the same for a pseudometric space, that is, a space where dθ 1,θ 2 =0need not imply θ 1 = θ 2. In fact, all results in our paper that refer to metric spaces also apply to pseudometric spaces. The slight increase in generality is sometimes convenient when dealing with metrics defined by L p norms on functions. Standard chaining arguments see, for example, Pollard 1989, give maximal inequalities for processes with subgaussian increments controlled by a metric on the index set. Theorem 2. Let {W t : t T } be a stochastic process, indexed by a metric space T,d, with subgaussian increments. Let T δ be a δ-net for T. Suppose: i there is a constant K such that τw s W t Kds, t for all s, t T ; ii J δ := δ 1 ρny, S, d dy <, where ρn := + log N. 0 Then there is a universal constant c 1 such that 1 P supt W t c 2 KJ δ + ρnδ, T, d max s Tδ τw s. 1 Remark. We should perhaps work with outer expectations because, in general, there is no guarantee that a supremum of uncountably many random variables is measurable. For concrete examples, such as the one discussed in Section 4, measurability can usually be established by routine separability arguments. Accordingly, we will ignore the issue in this paper. Under the assumption that varu i σ 2,theX process from 6 need not have subgaussian increments. However, it can be bounded in a stochastic sense by a symmetrized process X θ := i n ɛ iu i g i θ, where the 2n random variables ɛ 1,...,ɛ n,u 1,...,u n are mutually independent with P{ɛ i =+1} =1/2 =P{ɛ i = 1}. In fact, for each subset S of the index set Θ, 8 P sup θ S Xθ 2 4P sup θ S X θ 2.

NONLINEAR LEAST-SQUARES ESTIMATION 7 For a proof see, for example, van der Vaart and Wellner 1996, Lemma 2.3.1. The process X has conditionally subgaussian increments with 9 2 2 τ u Xθ 1 Xθ 2 i n u2 i g i θ 1 g i θ 2. The subscript u indicates the conditioning on u. Corollary 1. Let S δ be a δ-net for S and let X be as in 6. Suppose i Pu i =0and varu i σ 2 for i =1,...,n ii there is a metric d for which J δ := δ ρny, S, d dy < 0 iii there are constants L 1,...,L n for which g i θ 1 g i θ 2 L i dθ 1,θ 2 for all i and all θ 1,θ 2 S iv there are constants b 1,...,b n for which g i θ b i for all i and all θ in S. Then there is a universal constant c 2 such that where L := i L2 i and B := i b2 i. P sup X θ 2 c 2 2σ 2 LJ δ + BρNδ, S, d 2 θ S Proof. From 9, τ u X θ 1 X θ 2 L u dθ 1,θ 2 where L u := i n L2 i u2 i and τ u X θ B u := i n b2 i u2 i Apply Theorem 2 conditionally to the process X to bound P u sup θ S X θ 2. Then invoke inequality 8, using the fact that PL 2 u σ 2 L 2 and PB 2 u σ 2 B 2.

NONLINEAR LEAST-SQUARES ESTIMATION 8 3. LIMIT THEOREMS Inequality 5 and Corollary 1, with g i θ =f i θ f i θ 0, give us some probabilistic control over θ n. Theorem 3. Let S be a subset of Θ equipped with a pseudometric d. Let {L i : i = 1,...,n}, {b i : i =1,...,n}, and δ be positive constants such that i f i θ 1 f i θ 2 L i dθ 1,θ 2 for all θ 1,θ 2 S ii f i θ f i θ 0 b i for all θ S iii J δ := δ Ny, ρ S, d dy < 0 Then 2/R P{ θ n S} 4c 2 2σ Bρ 2 Nδ, S, d + LJ 4 δ, where R := inf{κθ :θ S}, and L 2 = i L2 i, and B 2 := i b2 i. The Theorem becomes more versatile in its application if we partition S into a countable union of subsets S k, each equipped with its own pseudometric and Lipschitz constants. We then have P{ θ n k S k } smaller than a sum over k of bounds analogous to those in the Theorem. As shown in Section 4, this method works well for the Wu example if we take S k = {θ : R k κ n θ <R k+1 }, for an {R k } sequence increasing geometrically. A similar appeal to Corollary 1, with the g i θ as partial derivatives of f i θ functions, gives us enough local control over Z n to go beyond consistency. To accommodate the application in Section 4, we change notation slightly by working with a triangular array: for each n, y in = f in θ 0 +u in, for i =1, 2,...,n, where the {u in : i =1,...,n} are unobserved independent random variables with mean zero and variance bounded by σ 2. Theorem 4. Suppose θ n θ 0 in probability, with θ 0 an interior point of Θ, a subset of R p. Suppose also:

NONLINEAR LEAST-SQUARES ESTIMATION 9 i Each f in is continuously differentiable in a neighborhood N of θ 0 with derivatives D in θ = f in θ / θ. ii γ 2 n := i n D inθ 0 2 as n. iii There are constants {M in } with i n M 2 in = Oγ 2 n andametricd on N for which D in θ 1 D in θ 2 M in dθ 1,θ 2 for θ 1,θ 2 N. iv The smallest eigenvalue of the matrix V n away from zero for n large enough. v 1 Ny, ρ N,d dy < 0 vi dθ, θ 0 0 as θ θ 0. Then κ n θ n =O p 1 and where ξ i,n = γn 1 Vn 1 D in θ 0. = γ 2 n γ n θ n θ 0 = i n ξ i,nu in + o p 1 = O p 1. i n D inθ 0 D in θ 0 is bounded Proof. Let D be the p n matrix with ith column D in θ 0, so that γ 2 n = tracedd and V n = γn 2 DD. The main idea of the proof is to replace fθ by fθ 0 +D θ θ 0, thereby approximating θ n by the least-squares solution θ n := θ 0 +DD 1 Du = argmin θ R p y fθ 0 D θ θ 0. To simplify notation, assume with no loss of generality, that fθ 0 =0and θ 0 =0.Also, drop extra n subscripts when the meaning is clear. The assertion of the Theorem is that θ n = θ n + o p γn 1. Without loss of generality, suppose the smallest eigenvalue of V n is larger than a fixed constant c 2 0 > 0. Then from which it follows that γ 2 n = tracedd sup t 1 D t 2 inf t 1 D t 2 = c 2 0γ 2 n, 10 c 0 t D t /γ n t for all t R p.

NONLINEAR LEAST-SQUARES ESTIMATION 10 Similarly, P Du 2 = trace DPuu D σ 2 γn, 2 implying that Du = O p γ n and θ n = γ 2 n Vn 1 Du = O p γn 1. In particular, P{θ n N} 1, because θ 0 is an interior point of Θ. Note also that P i n ξ iu i 2 σ 2 trace i n ξ iξ i=σ 2 tracev 1 n <. Consequently i n ξ iu i = O p 1. From the assumed consistency, we know that there is a sequence of balls N n N that shrink to {0} for which P{ θ n N n } 1. From vi and v, it follows that both r n := sup{dθ, 0 : θ N n } and J rn = r n ρ Ny, N,d dy converge to zero as 0 n. 11 The n 1 remainder vector Rθ :=fθ D θ has ith component 1 R i θ =f i θ D i 0 θ = θ D i tθ D i 0 dt. Uniformly in the neighborhood N n we have 1/2 Rθ θ M in 2 rn = o θ γ n, i n which, together with the upper bound from inequality 10, implies 0 12 fθ 2 = D θ 2 + oγ 2 n θ 2 =Oγ 2 n θ 2 as θ 0. In the neighborhood N n, via 11 we also have, u Rθ θ sup s Nn i u i D i s D i 0. From Corollary 1 with g i θ =D i θ D i 0 deduce that P sup s Nn u i D i s D i 0 2 c 2 2σ 2 Jr 2 i n i M in 2 = oγn, 2 which implies 13 u Rθ = o p γ n θ uniformly for θ N n.

NONLINEAR LEAST-SQUARES ESTIMATION 11 Approximations 12 and 13 give us uniform approximations for the criterion functions in the shrinking neighborhoods N n : G n θ = u fθ 2 u 2 14 = 2u fθ+ fθ 2 = 2u D θ + D θ 2 + o p γ n θ +o p γ 2 n θ 2 = u D θ n 2 u 2 + D θ θ n 2 + o p γ n θ +o p γn θ 2 2. The uniform smallness of the remainder terms lets us approximate G n at random points that are known to lie in N n. The rest of the argument is similar to that of Chernoff 1954. When θ n N n we have G n θ n G n 0, implying D θ n θ n 2 + o p γ n θ n +o p γ 2 n θ n 2 D θ n 2. Invoke 10 again, simplifying the last approximation to c 2 0 γ n θn γ n θ n 2 O p 1 + o p γ n θn + γ n θn 2. It follows that θ n = O p γ 1 n and, via 12, κ n θ n = f θ n = O p 1. We may also assume that N n shrinks slowly enough to ensure that P{θ n N n } 1. When both θ n and θ n lie in N n the inequality G n θ n G n θ n and approximation 14 give It follows that θ n = θ n + o p γ 1 n. D θ n θ n 2 + o p 1 o p 1. Remark. If the errors are iid and max ξ i,n = o1 then the distribution of i n ξ i,nu in is asymptotically N 0,σ 2 Vn 1.

NONLINEAR LEAST-SQUARES ESTIMATION 12 4. ANALYSISOFMODEL3: WU S EXAMPLE The results in this section illustrate the work of our limit theorems in a particular case where Wu s method fails. We prove both consistency and a central limit theorem for the model 3 with θ 0 =λ 0, 1/2. In fact, without loss of generality, λ 0 =1. As before, let l n = log n. Remember θ =λ, µ with a R and 0 µ C µ for a finite constant C µ greater than 1/2, which ensures that θ 0 =1, 1/2 is an interior point of the parameter space. Taking C µ =1/2 would complicate the central limit theorem only slightly. The behavior of θ n is determined by the behavior of the function or its standardized version G n γ := i n i 1+γ for γ 1, g n β :=G n β/l n /G n 0 = i n i 1 /G n 0 exp βl i /l n, which is the moment generating function of the probability distribution that puts mass i 1 /G n 0 at l i /l n, for i =1,...,n. For large n, the function g n is well approximated by the increasing, nonnegative function e β 1/β for β 0 gβ =, 1 for β =0 the moment generating function of the uniform distribution on 0, 1. More precisely, comparison of the sum with the integral n 1 x 1+γ dx gives 15 G n γ = l n gγl n +r n γ with 0 r n γ 1 for γ 1. The distributions corresponding to both g n and g are concentrated on [0, 1]. Both functions have the properties described in the following lemma. Lemma 1. Suppose hγ =P expγx, the moment generating function of a probability distribution concentrated on [0, 1]. Then i log h is convex

NONLINEAR LEAST-SQUARES ESTIMATION 13 ii hγ 2 /h2γ is unimodal: increasing for γ<0, decreasing for γ>0, achieving its maximum value of 1 at γ =0 iii h γ hγ Proof. Assertion i is just the well known fact that the logarithm of a moment generating function is convex. Thus h /h, the derivative of log h, is an increasing function, which implies ii because d dγ log hγ 2 h2γ =2 h γ hγ 2γ 2h h2γ. Property iii comes from the representation h γ =P xe γx. Remark. Direct calculation shows that gγ 2 /g2γ is a symmetric function. Reparametrize by putting β =1 2µl n, with 1 2C µ l n β l n,andα = λ G n β/l n. Notice that fθ = α and that θ 0 corresponds to α 0 = G n 0 l n and β 0 =0.Also f i θ =αν i β/l n where ν i γ :=i 1/2 expγl i /2/ G n γ, and 16 κ n θ 2 = G n 0 λ 2 g n β 2λg n β/2 + 1. We define ν i := sup γ 1 ν i γ. Lemma 2. For all α, β corresponding to θ =λ, µ R [0,C µ ]: i κ n θ G n 0 α κ n θ+ G n 0 ii i n ν2 i = O log log n iii dν i β/l n /dβ 1ν 2 iβ/l n iv f i α 1,β 1 f i α 2,β 2 α 1 α 2 + 1 α 2 2 β 1 β 2 v f i θ f i θ 0 i 1/2 + α ν i ν i Proof. Inequalities i and v follow from the triangle inequality.

NONLINEAR LEAST-SQUARES ESTIMATION 14 For inequality ii, first note that ν1 2 1. Fori 2, separate out contributions from three ranges: νi 2 = max sup ν i γ 2, sup ν i γ 2, sup ν i γ 2. 1 γ 1/l n γ <1/l n γ 1/l n For γ 1/l n, invoke 15 to get a tractable upper bound: ν i γ 2 i 1 expγl i l n gγl n expγl i exp log γ + γ logi/n i 1 γ i 1. expγl n 1 1 e 1 The last expression achieves its maximum over [1/l n, 1] at 1/ logn/i if 1 i n/e γ 0 :=, 1 if n/e i n which gives 17 sup 1 γ 1/l n ν i γ 2 e 1 1 H n i n/e n where Hx :=1/ x log1/x. Similarly, if 1 <γl n < 1, ν i γ 2 expγl i il n gγl n expl i/l n il n g 1 e/g 1. il n The last term is smaller than a constant multiple of the bound from 17. Finally, if γ = δ 1/l n and i 2 then ν i γ 2 i 1 exp δl i δ 1 exp δl n In summary, for some universal constant C, i n/e νi 2 C max n 1 H n Bounding sums by integrals we thus have C 1 n νi 2 i=2 1/e 1/n Hx dx + H1/e/n + exp log δ δl i i 1 e 1 /1 e 1. 1 e 1 il i, 1 i log i n 2 if 2 i n. 1 x log x dx = O log log n.

NONLINEAR LEAST-SQUARES ESTIMATION 15 For iii note that 2 d dβ ν iβ/l n =2 d 1/2 dβ exp 1 βl li 2 i/l n G n 0g n β = g nβ ν i β, l n g n β which is bounded in absolute value by ν i β because 0 g nβ g n β. For iv f i α 1,β 1 f i α 2,β 2 α 1 α 2 ν i β 1 /l n + α 2 ν i β 1 /l n ν i β 2 /l n α 1 α 2 ν i + α 2 β 1 β 2 1 2 ν i, the bound for the second term coming from the mean-value theorem and iii. Lemma 3. For ɛ>0, let N ɛ = {θ : max λ 1, β ɛ. Ifɛ is small enough, there exists a constant C ɛ > 0 such that inf{κ n θ :θ/ N ɛ } C ɛ ln when n is large enough. Proof. Suppose β ɛ. Remember that G n 0 l n. Minimize over λ the lower bound 16 for κ n θ 2 by choosing λ = g n β/2/g n β, then invoke Lemma 1ii. κ n θ 2 1 g nβ/2 2 gn ɛ/2 2 1 max, g n ɛ/2 2 1 gɛ/22 > 0. l n g n β g n ɛ g n ɛ gɛ If β ɛ and ɛ is small enough to make 1 ɛe ɛ/2 < 1 < 1 + ɛe ɛ/2,use κ n θ 2 = i n i 1 λ expβl i /2l n 1 2. If λ 1+ɛ bound each summand from below by i 1 1 + ɛe ɛ/2 1 2.Ifλ 1 ɛ bound each summand from below by i 1 1 1 ɛe ɛ/2 2. 4.1. Consistency. On the annulus S R := {R κ n θ < 2R} we have Note that i n a K R := 2R + G n 0 f i θ 1 f i θ 2 K R ν i d R θ 1,θ 2 f i θ f i θ 0 b i := i 1/2 + K R ν i. where d R θ 1,θ 2 := α 1 α 2 /K R + 1 2 β 1 β 2 i 1/2 + K R ν i 2 = Oln + K 2 R log l n =OK 2 RL n where L n := log log n.

NONLINEAR LEAST-SQUARES ESTIMATION 16 The rectangle { α K R, β cl n } can be partitioned into Oy 1 l n /y subrectangles of d R -diameter at most y. Thus Ny, S R,d R C 0 l n /y 2 for a constant C 0 that depends only on C µ, which gives 1 0 ρ Ny, S R,d R dy = O Ln. Apply Theorem 3 with δ =1to conclude that P{ θ n S R } C 1 K 2 RL 2 n/r 4 C 2 R 2 + l n L 2 n/r 4. Put R = C 3 2 k l n L 2 n 1/4 then sum over k to deduce that P{κ n θ n C 3 l n L 2 n 1/4 } ɛ eventually if the constant C 3 is large enough. That is κ n θ n =O p l n L 2 n 1/4 and, via Lemma 3, λ n 1 = o p 1 and 2l n µ n µ 0 = β = o p 1. 4.2. Central limit theorem. This time work with the λ, β reparametrization, with D i λ, β = fi λ,β λ f i λ, β, f iλ,β β = λi 1/2+β/2ln = 1/λ, l i /2l n f i λ, β and θ 0 =λ 0,β 0 =1, 0. Take d as the usual two-dimensional Euclidean distance in the λ, β space. For simplicity of notation, we omit some n subscripts, even though the relationship between θ and λ, β changes with n. We have just shown that the LSE λ n, β n is consistent. Comparison of sums with analogous integrals gives the approximations 18 i n i 1 l p 1 i = l p n/p + r p with r p 1 for p =0, 1, 2,... In consequence, γ 2 n = i n D iλ 0,β 0 2 = i n i 1 1+l 2 i /4l 2 n = 13 12 l n + O1

References 17 and V n = γ 2 n i n i 1 1 l i/2l n = V + O1/l n where V = 1 l i /2l n l 2 i /4l 2 n 13 12 3. 3 1 The smaller eigenvalue of V n converges to the smaller eigenvalue of the positive definite matrix V, which is strictly positive. Within the neighborhood N ɛ := {max λ 1, β f i λ, β and D i λ, β are bounded by a multiple of i 1/2. Thus ɛ}, for a fixed ɛ 1/2, both D i θ 1 D i θ 1 λ 1 1 λ 1 2 f i θ 1 +3 f i θ 1 f i θ 2 C ɛ i 1/2 dθ 1,θ 2. That is, we may take M i as a multiple of i 1/2, which gives i n M 2 i = Ol n. All the conditions of Theorem 4 are satisfied. We have ln λ n 1, β n = 12 13 i n u ii 1/2 ln 1/2 1,l i /2l n V 1 + o p 1. REFERENCES Chernoff, H. 1954. On the distribution of the likelihood ratio. Annals of Mathematical Statistics 25, 573 578. Jennrich, R. I. 1969. Asymptotic properties of non-linear least squares estimators. Annals of Mathematical Statistics 40, 633 643. Lai, T. L. 1994. Asymptotic properties of nonlinear least-squares estimates in stochastic regression models. Annals of Statistics 22, 1917 1930. Pollard, D. 1989. Asymptotics via empirical processes with discussion. Statistical Science 4, 341 366. Skouras, K. 2000. Strong consistency in nonlinear stochastic regression models. Annals of Statistics 28, 871 879. Van de Geer, S. 1990. Estimating a regression function. Annals of Statistics 18, 907 924.

References 18 Van de Geer, S. and M. Wegkamp 1996. Consistency for the least squares estimator in nonparametric regression. Annals of Statistics 24, 2513 2523. van der Vaart, A. W. and J. A. Wellner 1996. Weak Convergence and Empirical Process: With Applications to Statistics. Springer-Verlag. Wu, C.-F. 1981. Asymptotic theory of nonlinear least squares estimation. Annals of Statistics 9, 501 513. STATISTICS DEPARTMENT, YALE UNIVERSITY, BOX 208290 YALE STATION, NEW HAVEN, CT 06520-8290. E-mail address: david.pollard@yale.edu; peter.radchenko@yale.edu URL: http://www.stat.yale.edu/ pollard/; http://pantheon.yale.edu/ pvr4