NONLINEAR LEAST-SQUARES ESTIMATION 1. INTRODUCTION

NONLINEAR LEAST-SQUARES ESTIMATION DAVID POLLARD AND PETER RADCHENKO ABSTRACT. The paper uses empirical process techniques to study the asymptotics of the least-squares estimator for the fitting of a nonlinear regression function. By combining and extending ideas of Wu and Van de Geer, it establishes new consistency and central limit theorems that hold under only second moment assumptions on the errors. An application to a delicate example of Wu s illustrates the use of the new theorems, leading to a normal approximation to the least-squares estimator with unusual logarithmic rescalings. 1. INTRODUCTION Consider the model where we observe y i for i =1,...,nwith 1 y i = f i θ+u i where θ Θ. The unobserved f i can be random or deterministic functions. The unobserved errors u i are random with zero means and finite variances. The index set Θ might be infinite dimensional. Later in the paper it will prove convenient to also consider triangular arrays of observations. Think of fθ = f 1 θ,...,f n θ and u = u 1,...,u n as points in R n. The model specifies a surface M Θ = {fθ : θ Θ} in R n. The vector of observations Date: Original version April 1995; revised March 2002; revised 2 December 2003. 2000 Mathematics Subject Classification. Primary 62E20. Secondary: 60F05, 62G08, 62G20. Key words and phrases. Nonlinear least squares, empirical processes, subgaussian, consistency, central limit theorem. 1

NONLINEAR LEAST-SQUARES ESTIMATION 2 y =y 1,...,y n is a random point in R n. The least squares estimator LSE θ n is defined to minimize the distance of y to M Θ, θ n = argmin θ Θ y fθ 2, where denotes the usual Euclidean norm on R n. Many authors have considered the behavior of θ n as n when the y i are generated by the model for a fixed θ 0 in Θ. When the f i are deterministic, it is natural to express assertions about convergence of θ n in terms of the n-dimensional Euclidean distance κ n θ 1,θ 2 := fθ 1 fθ 2. For example, Jennrich 1969 took Θ to be a compact subset of R p, the errors {u i } to be iid with zero mean and finite variance, and the f i to be continuous functions in θ. He proved strong consistency of the least squares estimator under the assumption that n 1 κ n θ 1,θ 2 2 converges uniformly to a continuous function that is zero if and only if θ 1 = θ 2.Healsogave conditions for asymptotic normality. Under similar assumptions Wu 1981, Theorem 1 proved that existence of a consistent estimator for θ 0 implies that 2 κ n θ :=κ n θ, θ 0 at each θ θ 0. If Θ is finite, the divergence 2 is also a sufficient condition for the existence of a consistent estimator Wu 1981, Theorem 2. His main consistency result his Theorem 3 may be reexpressed as a general convergence assertion. Theorem 1. Suppose the {f i } are deterministic functions indexed by a subset Θ of R p. Suppose also that sup i varu i < and κ n θ at each θ θ 0.LetSbe a bounded subset of Θ\{θ 0 } and let R n := inf θ S κ n θ. Suppose there exist constants {L i } such that i sup θ S f i θ f i θ 0 L i for each i; ii f i θ 1 f i θ 2 L i θ 1 θ 2 for all θ 1,θ 2 S; iii i n L2 i = ORn α for some α<4. Then P{ θ n / S eventually } =1.

NONLINEAR LEAST-SQUARES ESTIMATION 3 Remark. Assumption i implies i n L2 i κ nθ 2 for each θ in S, which forces R n. If Θ is compact and if for each θ θ 0 there is a neighborhood S = S θ satisfying the conditions of the Lemma then θ n θ 0 almost surely. Wu s paper was the starting point for several authors. For example, both Lai 1994 and Skouras 2000 generalized Wu s consistency results by taking the functions f i θ = f i θ, ω as random processes indexed by θ. They took the {u i } as a martingale difference sequence, with {f i } a predictable sequence of functions with respect to a filtration {F i }. Another line of development is typified by the work of Van de Geer 1990 and Van de Geer and Wegkamp 1996. They took f i θ =fx i,θ, where F = {f θ :Θ} is a set of deterministic functions and the x i,u i are iid pairs. In fact they identified Θ with the index set F. Under a stronger assumption about the errors, they established rates of convergence of κ n θ n in terms of L 2 entropy conditions on F, using empirical process methods that were developed after Wu s work. The stronger assumption was that the errors are uniformly subgaussian. In general, we say that a random variable W has a subgaussian distribution if there exists some finite τ such that P exptw exp 1 2 τ 2 t 2 for all t R. We write τw for the smallest such τ. Van de Geer assumed that sup i τu i <. Remark. Notice that we must have PW =0when W is subgaussian because the linear term in the expansion of P exptw must vanish. When PW =0, subgaussianity is equivalent to existence of a finite constant β for which P{ W x} 2 exp x 2 /β 2 for all x 0. In our paper we try to bring together the two lines of development. Our main motivation for working on nonlinear least squares was an example presented by Wu 1981, page 507. He noted that his consistency theorem has difficulties with a simple model, 3 f i θ =λi µ for θ =λ, µ Θ, a compact subset of R R +.

NONLINEAR LEAST-SQUARES ESTIMATION 4 For example, condition 2 does not hold for θ 0 =0, 0 at any θ with µ>1/2. When θ 0 =λ 0, 1/2, Wu s method fails in a more subtle way, but Van de Geer 1990 s method would work if the errors satisfied the subgaussian assumption. In Section 4, under only second moment assumptions on the errors, we establish weak consistency and a central limit theorem. The main idea behind all the proofs ours, as well as those of Wu and Van de Geer is quite simple. The LSE also minimizes the random function 4 G n θ := y fθ 2 u 2 = κ n θ 2 2Z n θ where Z n θ :=u fθ u fθ 0. In particular, G n θ n G n θ 0 =0,thatis, 1 2 κ n θ n 2 Z n θ n. For every subset S of Θ, 5 P{ θ n S} P{ θ S : Z n θ 1κ 2 nθ 2 } 4Psup Z n θ 2 / inf κ nθ 4. θ S θ S The final bound calls for a maximal inequality for Z n. Our methods for controlling Z n are similar in spirit to those of Van de Geer. Under her subgaussian assumption, for every class of real functions {g θ : θ Θ}, the process 6 Xθ = i n u ig i θ has subgaussian increments. Indeed, if τu i τ for all i then 2 2 τ Xθ 1 Xθ 2 i n τu i 2 g i θ 1 g i θ 2 τ 2 gθ 1 gθ 2 2. That is, the tails of Xθ 1 Xθ 2 are controlled by the n-dimensional Euclidean distance between the vectors gθ 1 and gθ 2. This property allowed her to invoke a chaining bound similar to our Theorem 2 for the tail probabilities of sup θ S Z n θ for various annuli S = {θ : R κ n θ < 2R}. Under the weaker second moment assumption on the errors, we apply symmetrization arguments to transform to a problem involving a new process Znθ with conditionally subgaussian increments. We avoid Van de Geer s subgaussianity assumption at the cost of extra Lipschitz conditions on the f i θ, analogous to Assumption ii of Theorem 1,

NONLINEAR LEAST-SQUARES ESTIMATION 5 which lets us invoke chaining bounds for conditional second moments of sup θ S Znθ for various S. In Section 3 we prove a new consistency theorem Theorem 3 and a new central limit theorem Theorem 4, generalizing Wu s Theorem 5 for nonlinear LSEs. More precisely, our consistency theorem corresponds to an explicit bound for P{κ n θ n R}, but we state the result in a form that makes comparison with Theorem 1 easier. Our Theorem does not imply almost sure convergence, but our techniques could easily be adapted to that task. We regard the consistency as a preliminary to the next level of asymptotics and not as an end in itself. We describe the local asymptotic behavior with another approximation result, Theorem 4, which can easily be transformed into a central limit theorem under a variety of mild assumptions on the {u i } errors. For example, in Section 4 we apply the Theorem to the model 3, to sharpen the consistency result at θ 0 =1, 1/2 into the approximation 7 l 1/2 n λ n 1,l 3/2 n 1 2 µ n = i n u iζ i,n + o p 1 where l n := log n and ζ i,n = i 1/2 l 1/2 n 2 6 2. 6 24 l i /l n The sum on the right-hand side of 7 is of order O p 1 when sup i varu i <. Ifthe{u i } are also identically distributed, the sum has a limiting multivariate normal distribution. 2. MAXIMAL INEQUALITIES Assumption ii of Theorem 1 ensures that the increments Z n θ 1 Z n θ 2 are controlled by the ordinary Euclidean distance in Θ; we allow for control by more general metrics. Wu invoked a maximal inequality for sums of random continuous processes, a result derived from a bound on the covering numbers for M θ as a subset of R n under the usual Euclidean distance; we work with covering numbers for other metrics.

NONLINEAR LEAST-SQUARES ESTIMATION 6 Definition 1. Let T,d be a metric space. The covering number Nδ, S, d is defined as the size of the smallest δ-net for S, that is, the smallest N for which there are points t 1,...,t N in T with min i ds, t i δ for every s in S. Remark. The definition is the same for a pseudometric space, that is, a space where dθ 1,θ 2 =0need not imply θ 1 = θ 2. In fact, all results in our paper that refer to metric spaces also apply to pseudometric spaces. The slight increase in generality is sometimes convenient when dealing with metrics defined by L p norms on functions. Standard chaining arguments see, for example, Pollard 1989, give maximal inequalities for processes with subgaussian increments controlled by a metric on the index set. Theorem 2. Let {W t : t T } be a stochastic process, indexed by a metric space T,d, with subgaussian increments. Let T δ be a δ-net for T. Suppose: i there is a constant K such that τw s W t Kds, t for all s, t T ; ii J δ := δ 1 ρny, S, d dy <, where ρn := + log N. 0 Then there is a universal constant c 1 such that 1 P supt W t c 2 KJ δ + ρnδ, T, d max s Tδ τw s. 1 Remark. We should perhaps work with outer expectations because, in general, there is no guarantee that a supremum of uncountably many random variables is measurable. For concrete examples, such as the one discussed in Section 4, measurability can usually be established by routine separability arguments. Accordingly, we will ignore the issue in this paper. Under the assumption that varu i σ 2,theX process from 6 need not have subgaussian increments. However, it can be bounded in a stochastic sense by a symmetrized process X θ := i n ɛ iu i g i θ, where the 2n random variables ɛ 1,...,ɛ n,u 1,...,u n are mutually independent with P{ɛ i =+1} =1/2 =P{ɛ i = 1}. In fact, for each subset S of the index set Θ, 8 P sup θ S Xθ 2 4P sup θ S X θ 2.

NONLINEAR LEAST-SQUARES ESTIMATION 7 For a proof see, for example, van der Vaart and Wellner 1996, Lemma 2.3.1. The process X has conditionally subgaussian increments with 9 2 2 τ u Xθ 1 Xθ 2 i n u2 i g i θ 1 g i θ 2. The subscript u indicates the conditioning on u. Corollary 1. Let S δ be a δ-net for S and let X be as in 6. Suppose i Pu i =0and varu i σ 2 for i =1,...,n ii there is a metric d for which J δ := δ ρny, S, d dy < 0 iii there are constants L 1,...,L n for which g i θ 1 g i θ 2 L i dθ 1,θ 2 for all i and all θ 1,θ 2 S iv there are constants b 1,...,b n for which g i θ b i for all i and all θ in S. Then there is a universal constant c 2 such that where L := i L2 i and B := i b2 i. P sup X θ 2 c 2 2σ 2 LJ δ + BρNδ, S, d 2 θ S Proof. From 9, τ u X θ 1 X θ 2 L u dθ 1,θ 2 where L u := i n L2 i u2 i and τ u X θ B u := i n b2 i u2 i Apply Theorem 2 conditionally to the process X to bound P u sup θ S X θ 2. Then invoke inequality 8, using the fact that PL 2 u σ 2 L 2 and PB 2 u σ 2 B 2.

NONLINEAR LEAST-SQUARES ESTIMATION 8 3. LIMIT THEOREMS Inequality 5 and Corollary 1, with g i θ =f i θ f i θ 0, give us some probabilistic control over θ n. Theorem 3. Let S be a subset of Θ equipped with a pseudometric d. Let {L i : i = 1,...,n}, {b i : i =1,...,n}, and δ be positive constants such that i f i θ 1 f i θ 2 L i dθ 1,θ 2 for all θ 1,θ 2 S ii f i θ f i θ 0 b i for all θ S iii J δ := δ Ny, ρ S, d dy < 0 Then 2/R P{ θ n S} 4c 2 2σ Bρ 2 Nδ, S, d + LJ 4 δ, where R := inf{κθ :θ S}, and L 2 = i L2 i, and B 2 := i b2 i. The Theorem becomes more versatile in its application if we partition S into a countable union of subsets S k, each equipped with its own pseudometric and Lipschitz constants. We then have P{ θ n k S k } smaller than a sum over k of bounds analogous to those in the Theorem. As shown in Section 4, this method works well for the Wu example if we take S k = {θ : R k κ n θ <R k+1 }, for an {R k } sequence increasing geometrically. A similar appeal to Corollary 1, with the g i θ as partial derivatives of f i θ functions, gives us enough local control over Z n to go beyond consistency. To accommodate the application in Section 4, we change notation slightly by working with a triangular array: for each n, y in = f in θ 0 +u in, for i =1, 2,...,n, where the {u in : i =1,...,n} are unobserved independent random variables with mean zero and variance bounded by σ 2. Theorem 4. Suppose θ n θ 0 in probability, with θ 0 an interior point of Θ, a subset of R p. Suppose also:

NONLINEAR LEAST-SQUARES ESTIMATION 9 i Each f in is continuously differentiable in a neighborhood N of θ 0 with derivatives D in θ = f in θ / θ. ii γ 2 n := i n D inθ 0 2 as n. iii There are constants {M in } with i n M 2 in = Oγ 2 n andametricd on N for which D in θ 1 D in θ 2 M in dθ 1,θ 2 for θ 1,θ 2 N. iv The smallest eigenvalue of the matrix V n away from zero for n large enough. v 1 Ny, ρ N,d dy < 0 vi dθ, θ 0 0 as θ θ 0. Then κ n θ n =O p 1 and where ξ i,n = γn 1 Vn 1 D in θ 0. = γ 2 n γ n θ n θ 0 = i n ξ i,nu in + o p 1 = O p 1. i n D inθ 0 D in θ 0 is bounded Proof. Let D be the p n matrix with ith column D in θ 0, so that γ 2 n = tracedd and V n = γn 2 DD. The main idea of the proof is to replace fθ by fθ 0 +D θ θ 0, thereby approximating θ n by the least-squares solution θ n := θ 0 +DD 1 Du = argmin θ R p y fθ 0 D θ θ 0. To simplify notation, assume with no loss of generality, that fθ 0 =0and θ 0 =0.Also, drop extra n subscripts when the meaning is clear. The assertion of the Theorem is that θ n = θ n + o p γn 1. Without loss of generality, suppose the smallest eigenvalue of V n is larger than a fixed constant c 2 0 > 0. Then from which it follows that γ 2 n = tracedd sup t 1 D t 2 inf t 1 D t 2 = c 2 0γ 2 n, 10 c 0 t D t /γ n t for all t R p.

NONLINEAR LEAST-SQUARES ESTIMATION 10 Similarly, P Du 2 = trace DPuu D σ 2 γn, 2 implying that Du = O p γ n and θ n = γ 2 n Vn 1 Du = O p γn 1. In particular, P{θ n N} 1, because θ 0 is an interior point of Θ. Note also that P i n ξ iu i 2 σ 2 trace i n ξ iξ i=σ 2 tracev 1 n <. Consequently i n ξ iu i = O p 1. From the assumed consistency, we know that there is a sequence of balls N n N that shrink to {0} for which P{ θ n N n } 1. From vi and v, it follows that both r n := sup{dθ, 0 : θ N n } and J rn = r n ρ Ny, N,d dy converge to zero as 0 n. 11 The n 1 remainder vector Rθ :=fθ D θ has ith component 1 R i θ =f i θ D i 0 θ = θ D i tθ D i 0 dt. Uniformly in the neighborhood N n we have 1/2 Rθ θ M in 2 rn = o θ γ n, i n which, together with the upper bound from inequality 10, implies 0 12 fθ 2 = D θ 2 + oγ 2 n θ 2 =Oγ 2 n θ 2 as θ 0. In the neighborhood N n, via 11 we also have, u Rθ θ sup s Nn i u i D i s D i 0. From Corollary 1 with g i θ =D i θ D i 0 deduce that P sup s Nn u i D i s D i 0 2 c 2 2σ 2 Jr 2 i n i M in 2 = oγn, 2 which implies 13 u Rθ = o p γ n θ uniformly for θ N n.

NONLINEAR LEAST-SQUARES ESTIMATION 11 Approximations 12 and 13 give us uniform approximations for the criterion functions in the shrinking neighborhoods N n : G n θ = u fθ 2 u 2 14 = 2u fθ+ fθ 2 = 2u D θ + D θ 2 + o p γ n θ +o p γ 2 n θ 2 = u D θ n 2 u 2 + D θ θ n 2 + o p γ n θ +o p γn θ 2 2. The uniform smallness of the remainder terms lets us approximate G n at random points that are known to lie in N n. The rest of the argument is similar to that of Chernoff 1954. When θ n N n we have G n θ n G n 0, implying D θ n θ n 2 + o p γ n θ n +o p γ 2 n θ n 2 D θ n 2. Invoke 10 again, simplifying the last approximation to c 2 0 γ n θn γ n θ n 2 O p 1 + o p γ n θn + γ n θn 2. It follows that θ n = O p γ 1 n and, via 12, κ n θ n = f θ n = O p 1. We may also assume that N n shrinks slowly enough to ensure that P{θ n N n } 1. When both θ n and θ n lie in N n the inequality G n θ n G n θ n and approximation 14 give It follows that θ n = θ n + o p γ 1 n. D θ n θ n 2 + o p 1 o p 1. Remark. If the errors are iid and max ξ i,n = o1 then the distribution of i n ξ i,nu in is asymptotically N 0,σ 2 Vn 1.

NONLINEAR LEAST-SQUARES ESTIMATION 12 4. ANALYSISOFMODEL3: WU S EXAMPLE The results in this section illustrate the work of our limit theorems in a particular case where Wu s method fails. We prove both consistency and a central limit theorem for the model 3 with θ 0 =λ 0, 1/2. In fact, without loss of generality, λ 0 =1. As before, let l n = log n. Remember θ =λ, µ with a R and 0 µ C µ for a finite constant C µ greater than 1/2, which ensures that θ 0 =1, 1/2 is an interior point of the parameter space. Taking C µ =1/2 would complicate the central limit theorem only slightly. The behavior of θ n is determined by the behavior of the function or its standardized version G n γ := i n i 1+γ for γ 1, g n β :=G n β/l n /G n 0 = i n i 1 /G n 0 exp βl i /l n, which is the moment generating function of the probability distribution that puts mass i 1 /G n 0 at l i /l n, for i =1,...,n. For large n, the function g n is well approximated by the increasing, nonnegative function e β 1/β for β 0 gβ =, 1 for β =0 the moment generating function of the uniform distribution on 0, 1. More precisely, comparison of the sum with the integral n 1 x 1+γ dx gives 15 G n γ = l n gγl n +r n γ with 0 r n γ 1 for γ 1. The distributions corresponding to both g n and g are concentrated on [0, 1]. Both functions have the properties described in the following lemma. Lemma 1. Suppose hγ =P expγx, the moment generating function of a probability distribution concentrated on [0, 1]. Then i log h is convex

NONLINEAR LEAST-SQUARES ESTIMATION 13 ii hγ 2 /h2γ is unimodal: increasing for γ<0, decreasing for γ>0, achieving its maximum value of 1 at γ =0 iii h γ hγ Proof. Assertion i is just the well known fact that the logarithm of a moment generating function is convex. Thus h /h, the derivative of log h, is an increasing function, which implies ii because d dγ log hγ 2 h2γ =2 h γ hγ 2γ 2h h2γ. Property iii comes from the representation h γ =P xe γx. Remark. Direct calculation shows that gγ 2 /g2γ is a symmetric function. Reparametrize by putting β =1 2µl n, with 1 2C µ l n β l n,andα = λ G n β/l n. Notice that fθ = α and that θ 0 corresponds to α 0 = G n 0 l n and β 0 =0.Also f i θ =αν i β/l n where ν i γ :=i 1/2 expγl i /2/ G n γ, and 16 κ n θ 2 = G n 0 λ 2 g n β 2λg n β/2 + 1. We define ν i := sup γ 1 ν i γ. Lemma 2. For all α, β corresponding to θ =λ, µ R [0,C µ ]: i κ n θ G n 0 α κ n θ+ G n 0 ii i n ν2 i = O log log n iii dν i β/l n /dβ 1ν 2 iβ/l n iv f i α 1,β 1 f i α 2,β 2 α 1 α 2 + 1 α 2 2 β 1 β 2 v f i θ f i θ 0 i 1/2 + α ν i ν i Proof. Inequalities i and v follow from the triangle inequality.

NONLINEAR LEAST-SQUARES ESTIMATION 14 For inequality ii, first note that ν1 2 1. Fori 2, separate out contributions from three ranges: νi 2 = max sup ν i γ 2, sup ν i γ 2, sup ν i γ 2. 1 γ 1/l n γ <1/l n γ 1/l n For γ 1/l n, invoke 15 to get a tractable upper bound: ν i γ 2 i 1 expγl i l n gγl n expγl i exp log γ + γ logi/n i 1 γ i 1. expγl n 1 1 e 1 The last expression achieves its maximum over [1/l n, 1] at 1/ logn/i if 1 i n/e γ 0 :=, 1 if n/e i n which gives 17 sup 1 γ 1/l n ν i γ 2 e 1 1 H n i n/e n where Hx :=1/ x log1/x. Similarly, if 1 <γl n < 1, ν i γ 2 expγl i il n gγl n expl i/l n il n g 1 e/g 1. il n The last term is smaller than a constant multiple of the bound from 17. Finally, if γ = δ 1/l n and i 2 then ν i γ 2 i 1 exp δl i δ 1 exp δl n In summary, for some universal constant C, i n/e νi 2 C max n 1 H n Bounding sums by integrals we thus have C 1 n νi 2 i=2 1/e 1/n Hx dx + H1/e/n + exp log δ δl i i 1 e 1 /1 e 1. 1 e 1 il i, 1 i log i n 2 if 2 i n. 1 x log x dx = O log log n.

NONLINEAR LEAST-SQUARES ESTIMATION 15 For iii note that 2 d dβ ν iβ/l n =2 d 1/2 dβ exp 1 βl li 2 i/l n G n 0g n β = g nβ ν i β, l n g n β which is bounded in absolute value by ν i β because 0 g nβ g n β. For iv f i α 1,β 1 f i α 2,β 2 α 1 α 2 ν i β 1 /l n + α 2 ν i β 1 /l n ν i β 2 /l n α 1 α 2 ν i + α 2 β 1 β 2 1 2 ν i, the bound for the second term coming from the mean-value theorem and iii. Lemma 3. For ɛ>0, let N ɛ = {θ : max λ 1, β ɛ. Ifɛ is small enough, there exists a constant C ɛ > 0 such that inf{κ n θ :θ/ N ɛ } C ɛ ln when n is large enough. Proof. Suppose β ɛ. Remember that G n 0 l n. Minimize over λ the lower bound 16 for κ n θ 2 by choosing λ = g n β/2/g n β, then invoke Lemma 1ii. κ n θ 2 1 g nβ/2 2 gn ɛ/2 2 1 max, g n ɛ/2 2 1 gɛ/22 > 0. l n g n β g n ɛ g n ɛ gɛ If β ɛ and ɛ is small enough to make 1 ɛe ɛ/2 < 1 < 1 + ɛe ɛ/2,use κ n θ 2 = i n i 1 λ expβl i /2l n 1 2. If λ 1+ɛ bound each summand from below by i 1 1 + ɛe ɛ/2 1 2.Ifλ 1 ɛ bound each summand from below by i 1 1 1 ɛe ɛ/2 2. 4.1. Consistency. On the annulus S R := {R κ n θ < 2R} we have Note that i n a K R := 2R + G n 0 f i θ 1 f i θ 2 K R ν i d R θ 1,θ 2 f i θ f i θ 0 b i := i 1/2 + K R ν i. where d R θ 1,θ 2 := α 1 α 2 /K R + 1 2 β 1 β 2 i 1/2 + K R ν i 2 = Oln + K 2 R log l n =OK 2 RL n where L n := log log n.

NONLINEAR LEAST-SQUARES ESTIMATION 16 The rectangle { α K R, β cl n } can be partitioned into Oy 1 l n /y subrectangles of d R -diameter at most y. Thus Ny, S R,d R C 0 l n /y 2 for a constant C 0 that depends only on C µ, which gives 1 0 ρ Ny, S R,d R dy = O Ln. Apply Theorem 3 with δ =1to conclude that P{ θ n S R } C 1 K 2 RL 2 n/r 4 C 2 R 2 + l n L 2 n/r 4. Put R = C 3 2 k l n L 2 n 1/4 then sum over k to deduce that P{κ n θ n C 3 l n L 2 n 1/4 } ɛ eventually if the constant C 3 is large enough. That is κ n θ n =O p l n L 2 n 1/4 and, via Lemma 3, λ n 1 = o p 1 and 2l n µ n µ 0 = β = o p 1. 4.2. Central limit theorem. This time work with the λ, β reparametrization, with D i λ, β = fi λ,β λ f i λ, β, f iλ,β β = λi 1/2+β/2ln = 1/λ, l i /2l n f i λ, β and θ 0 =λ 0,β 0 =1, 0. Take d as the usual two-dimensional Euclidean distance in the λ, β space. For simplicity of notation, we omit some n subscripts, even though the relationship between θ and λ, β changes with n. We have just shown that the LSE λ n, β n is consistent. Comparison of sums with analogous integrals gives the approximations 18 i n i 1 l p 1 i = l p n/p + r p with r p 1 for p =0, 1, 2,... In consequence, γ 2 n = i n D iλ 0,β 0 2 = i n i 1 1+l 2 i /4l 2 n = 13 12 l n + O1

References 17 and V n = γ 2 n i n i 1 1 l i/2l n = V + O1/l n where V = 1 l i /2l n l 2 i /4l 2 n 13 12 3. 3 1 The smaller eigenvalue of V n converges to the smaller eigenvalue of the positive definite matrix V, which is strictly positive. Within the neighborhood N ɛ := {max λ 1, β f i λ, β and D i λ, β are bounded by a multiple of i 1/2. Thus ɛ}, for a fixed ɛ 1/2, both D i θ 1 D i θ 1 λ 1 1 λ 1 2 f i θ 1 +3 f i θ 1 f i θ 2 C ɛ i 1/2 dθ 1,θ 2. That is, we may take M i as a multiple of i 1/2, which gives i n M 2 i = Ol n. All the conditions of Theorem 4 are satisfied. We have ln λ n 1, β n = 12 13 i n u ii 1/2 ln 1/2 1,l i /2l n V 1 + o p 1. REFERENCES Chernoff, H. 1954. On the distribution of the likelihood ratio. Annals of Mathematical Statistics 25, 573 578. Jennrich, R. I. 1969. Asymptotic properties of non-linear least squares estimators. Annals of Mathematical Statistics 40, 633 643. Lai, T. L. 1994. Asymptotic properties of nonlinear least-squares estimates in stochastic regression models. Annals of Statistics 22, 1917 1930. Pollard, D. 1989. Asymptotics via empirical processes with discussion. Statistical Science 4, 341 366. Skouras, K. 2000. Strong consistency in nonlinear stochastic regression models. Annals of Statistics 28, 871 879. Van de Geer, S. 1990. Estimating a regression function. Annals of Statistics 18, 907 924.

References 18 Van de Geer, S. and M. Wegkamp 1996. Consistency for the least squares estimator in nonparametric regression. Annals of Statistics 24, 2513 2523. van der Vaart, A. W. and J. A. Wellner 1996. Weak Convergence and Empirical Process: With Applications to Statistics. Springer-Verlag. Wu, C.-F. 1981. Asymptotic theory of nonlinear least squares estimation. Annals of Statistics 9, 501 513. STATISTICS DEPARTMENT, YALE UNIVERSITY, BOX 208290 YALE STATION, NEW HAVEN, CT 06520-8290. E-mail address: david.pollard@yale.edu; peter.radchenko@yale.edu URL: http://www.stat.yale.edu/ pollard/; http://pantheon.yale.edu/ pvr4