ASYMPTOTIC EQUIVALENCE OF DENSITY ESTIMATION AND GAUSSIAN WHITE NOISE. By Michael Nussbaum Weierstrass Institute, Berlin

Similar documents
Asymptotic Equivalence of Density Estimation and Gaussian White Noise

OPTIMAL POINTWISE ADAPTIVE METHODS IN NONPARAMETRIC ESTIMATION 1

Statistical inference on Lévy processes

Asymptotic Nonequivalence of Nonparametric Experiments When the Smoothness Index is ½

Lecture 35: December The fundamental statistical distances

Mathematical Institute, University of Utrecht. The problem of estimating the mean of an observed Gaussian innite-dimensional vector

Bayesian Regularization

LECTURE 15: COMPLETENESS AND CONVEXITY

Minimax Rate of Convergence for an Estimator of the Functional Component in a Semiparametric Multivariate Partially Linear Model.

Brownian motion. Samy Tindel. Purdue University. Probability Theory 2 - MA 539

The Arzelà-Ascoli Theorem

Empirical Processes: General Weak Convergence Theory

ON THE REGULARITY OF SAMPLE PATHS OF SUB-ELLIPTIC DIFFUSIONS ON MANIFOLDS

PCA with random noise. Van Ha Vu. Department of Mathematics Yale University

3 Integration and Expectation

Optimal Estimation of a Nonsmooth Functional

Part V. 17 Introduction: What are measures and why measurable sets. Lebesgue Integration Theory

S chauder Theory. x 2. = log( x 1 + x 2 ) + 1 ( x 1 + x 2 ) 2. ( 5) x 1 + x 2 x 1 + x 2. 2 = 2 x 1. x 1 x 2. 1 x 1.

On Consistent Hypotheses Testing

Optimal global rates of convergence for interpolation problems with random design

The Codimension of the Zeros of a Stable Process in Random Scenery

The small ball property in Banach spaces (quantitative results)

Weak convergence and Brownian Motion. (telegram style notes) P.J.C. Spreij

MATHS 730 FC Lecture Notes March 5, Introduction

THEOREMS, ETC., FOR MATH 515

Integration on Measure Spaces

Translation Invariant Experiments with Independent Increments

Recall that if X is a compact metric space, C(X), the space of continuous (real-valued) functions on X, is a Banach space with the norm

Brownian Motion. 1 Definition Brownian Motion Wiener measure... 3

Math 676. A compactness theorem for the idele group. and by the product formula it lies in the kernel (A K )1 of the continuous idelic norm

MAT 570 REAL ANALYSIS LECTURE NOTES. Contents. 1. Sets Functions Countability Axiom of choice Equivalence relations 9

Lecture 9 Metric spaces. The contraction fixed point theorem. The implicit function theorem. The existence of solutions to differenti. equations.

Lecture Notes in Advanced Calculus 1 (80315) Raz Kupferman Institute of Mathematics The Hebrew University

Asymptotically sufficient statistics in nonparametric regression experiments with correlated noise

Some Background Material

Topological properties

Math 118B Solutions. Charles Martin. March 6, d i (x i, y i ) + d i (y i, z i ) = d(x, y) + d(y, z). i=1

6 Classical dualities and reflexivity

9 Brownian Motion: Construction

Research Statement. Harrison H. Zhou Cornell University

Distance between multinomial and multivariate normal models

NOTES ON THE REGULARITY OF QUASICONFORMAL HOMEOMORPHISMS

Bayesian Adaptation. Aad van der Vaart. Vrije Universiteit Amsterdam. aad. Bayesian Adaptation p. 1/4

Problem 3. Give an example of a sequence of continuous functions on a compact domain converging pointwise but not uniformly to a continuous function

Asymptotically Efficient Nonparametric Estimation of Nonlinear Spectral Functionals

for all subintervals I J. If the same is true for the dyadic subintervals I D J only, we will write ϕ BMO d (J). In fact, the following is true

Minimax Risk: Pinsker Bound

Asymptotic Equivalence and Adaptive Estimation for Robust Nonparametric Regression

Fast learning rates for plug-in classifiers under the margin condition

Model Selection and Geometry

LARGE DEVIATIONS OF TYPICAL LINEAR FUNCTIONALS ON A CONVEX BODY WITH UNCONDITIONAL BASIS. S. G. Bobkov and F. L. Nazarov. September 25, 2011

Minimax Estimation of a nonlinear functional on a structured high-dimensional model

Tools from Lebesgue integration

Stanford Mathematics Department Math 205A Lecture Supplement #4 Borel Regular & Radon Measures

RATE-OPTIMAL GRAPHON ESTIMATION. By Chao Gao, Yu Lu and Harrison H. Zhou Yale University

The Dirichlet s P rinciple. In this lecture we discuss an alternative formulation of the Dirichlet problem for the Laplace equation:

3 (Due ). Let A X consist of points (x, y) such that either x or y is a rational number. Is A measurable? What is its Lebesgue measure?

Functional Analysis. Franck Sueur Metric spaces Definitions Completeness Compactness Separability...

LAN property for sde s with additive fractional noise and continuous time observation

Refining the Central Limit Theorem Approximation via Extreme Value Theory

Minimax theory for a class of non-linear statistical inverse problems

1 Weak Convergence in R k

RENORMALIZED SOLUTIONS ON QUASI OPEN SETS WITH NONHOMOGENEOUS BOUNDARY VALUES TONI HUKKANEN

Stochastic Convergence, Delta Method & Moment Estimators

21.2 Example 1 : Non-parametric regression in Mean Integrated Square Error Density Estimation (L 2 2 risk)

Integral Jensen inequality

Introduction to Empirical Processes and Semiparametric Inference Lecture 02: Overview Continued

Laplace s Equation. Chapter Mean Value Formulas

Segment Description of Turbulence

Wavelet Shrinkage for Nonequispaced Samples

MATH 722, COMPLEX ANALYSIS, SPRING 2009 PART 5

D I S C U S S I O N P A P E R

DISCUSSION: COVERAGE OF BAYESIAN CREDIBLE SETS. By Subhashis Ghosal North Carolina State University

n [ F (b j ) F (a j ) ], n j=1(a j, b j ] E (4.1)

JUHA KINNUNEN. Harmonic Analysis

Measure Theory on Topological Spaces. Course: Prof. Tony Dorlas 2010 Typset: Cathal Ormond

Lecture Notes 1: Vector spaces

2. Function spaces and approximation

INDISTINGUISHABILITY OF ABSOLUTELY CONTINUOUS AND SINGULAR DISTRIBUTIONS

Part III. 10 Topological Space Basics. Topological Spaces

Continuous Functions on Metric Spaces

Asymptotic efficiency of simple decisions for the compound decision problem

C m Extension by Linear Operators

METRIC HEIGHTS ON AN ABELIAN GROUP

Minimax lower bounds I

Risk-Minimality and Orthogonality of Martingales

Semiparametric Minimax Rates

PROBABILITY: LIMIT THEOREMS II, SPRING HOMEWORK PROBLEMS

Analysis in weighted spaces : preliminary version

Introductory Analysis I Fall 2014 Homework #9 Due: Wednesday, November 19

We denote the space of distributions on Ω by D ( Ω) 2.

Statistics 612: L p spaces, metrics on spaces of probabilites, and connections to estimation

2 (Bonus). Let A X consist of points (x, y) such that either x or y is a rational number. Is A measurable? What is its Lebesgue measure?

1 Differentiable manifolds and smooth maps

An Introduction to Complex Analysis and Geometry John P. D Angelo, Pure and Applied Undergraduate Texts Volume 12, American Mathematical Society, 2010

The Heine-Borel and Arzela-Ascoli Theorems

Proof. We indicate by α, β (finite or not) the end-points of I and call

INFORMATION-THEORETIC DETERMINATION OF MINIMAX RATES OF CONVERGENCE 1. By Yuhong Yang and Andrew Barron Iowa State University and Yale University

The Caratheodory Construction of Measures

LECTURE 3 Functional spaces on manifolds

Transcription:

The Annals of Statistics 1996, Vol. 4, No. 6, 399 430 ASYMPTOTIC EQUIVALENCE OF DENSITY ESTIMATION AND GAUSSIAN WHITE NOISE By Michael Nussbaum Weierstrass Institute, Berlin Signal recovery in Gaussian white noise with variance tending to zero has served for some time as a representative model for nonparametric curve estimation, having all the essential traits in a pure form. The equivalence has mostly been stated informally, but an approximation in the sense of Le Cam s deficiency distance would make it precise. The models are then asymptotically equivalent for all purposes of statistical decision with bounded loss. In nonparametrics, a first result of this kind has recently been established for Gaussian regression. We consider the analogous problem for the experiment given by n i.i.d. observations having density f on the unit interval. Our basic result concerns the parameter space of densities which are in a Hölder ball with exponent α > 1 and which are uniformly bounded away from zero. We show that an i. i. d. sample of size n with density f is globally asymptotically equivalent to a white noise experiment with drift f 1/ and variance 1 4 n 1. This represents a nonparametric analog of Le Cam s heteroscedastic Gaussian approximation in the finite dimensional case. The proof utilizes empirical process techniques related to the Hungarian construction. White noise models on f and log f are also considered, allowing for various automatic asymptotic risk bounds in the i.i.d. model from white noise. 1. Introduction and main result. One of the basic principles of Le Cam s (1986) asymptotic decision theory is to approximate general experiments by simple ones. In particular, weak convergence to Gaussian shift experiments has now become a standard tool for establishing asymptotic risk bounds. The risk bounds implied by weak convergence are generally estimates from below, and in most of the literature the efficiency of procedures is more or less shown on an ad hoc basis. However, a systematic approach to the attainment problem is also made possible by Le Cam s theory, based on the notion of strong convergence of experiments, which means proximity in the sense of the full deficiency distance. However, due to the inherent technical difficulties of handling the deficiency concept, this possibility is rarely used, even in root-n consistent parametric problems. In nonparametric curve estimation models of the ill posed class where there is no root-n consistency, research has focused for a long time on optimal rates of convergence. In these problems, limits of experiments for n 1/ - localized parameter are not directly useful for risk bounds. But now a theory of exact asymptotic risk constants is also developing in the context of slower Received March 1993; revised April 1996. AMS 1991 subject classifications. Primary 6G07; secondary 6B15, 6G0. Key words and phrases. Nonparametric experiments, deficiency distance, likelihood process, Hungarian construction, asymptotic minimax risk, curve estimation. 399

400 M. NUSSBAUM rates of convergence. Such an exact risk bound was first discovered by Pinsker (1980) in the problem of signal recovery in Gaussian white noise, which is by now recognized as the basic or typical nonparametric curve estimation problem. The cognitive value of this model had already been realized by Ibragimov and Khasminski (1977). These risk bounds have been established since then in a variety of other problems, for example, density, nonparametric regression and spectral density [see Efroimovich and Pinsker (198), Golubev (1984) and Nussbaum (1985)], and they have also been substantially extended conceptually [(Korostelev (1993) and Donoho and Johnstone (199)]. The theory is now at a stage where the approximation of the various particular curve estimation problems by the white noise model could be made formal. An important step in this direction has been made by Brown and Low (1996) by relating Gaussian regression to the signal recovery problem. These models are essentially the continuous and discrete versions of each other. The aim of this paper is to establish the formal approximation by the white noise model for the problem of density estimation from an i.i.d. sample. To formulate our main result, define a basic parameter space of densities as follows. For α 0 1 and M > 0, let α M = { f f x f y M x y α x y 0 1 } be a Hölder ball of functions with exponent α. For ε > 0 define a set ε as the set of densities on 0 1 bounded below by ε: { 1 } (1) ε = f f = 1 f x ε x 0 1 Define an a priori set, for given α 0 1, M > 0, ε > 0 0 () α M ε = α M ε Let be Le Cam s deficiency pseudodistance between experiments having the same parameter space. For the convenience of the reader a formal definition is given in Section 9. For two sequences of experiments E n and F n we shall say that they are asymptotically equivalent if E n F n 0 as n. Let dw denote the standard Gaussian white noise process on the unit interval. Theorem 1.1. Let be a set of densities contained in α M ε for some ε > 0 M > 0 and α > 1. Then the experiments given by observations (3) (4) y i i = 1 n i.i.d. with density f dy t = f 1/ t dt + 1 n 1/ dw t t 0 1 with f are asymptotically equivalent. This result is closely related to Le Cam s global asymptotic normality for parametric models. In the i.i.d. model let f be in a parametric family f ϑ ϑ where R k, which is sufficiently regular and has Fisher information

ASYMPTOTIC EQUIVALENCE 401 matrix I ϑ at point ϑ. Then the i.i.d. model may be approximated by a heteroscedastic Gaussian experiment (5) y = ϑ + n 1/ I ϑ 1/ η where η is a standard normal vector and ϑ. We see that (4) is a nonparametric analog of (5) when ϑ is identified with f 1/. Indeed, consider the identity for the Fisher information matrix in the parametric case f 1/ ϑ f1/ ϑ = 1 I 1/ ϑ ϑ ϑ + o ϑ ϑ Formally regarding f 1/ itself as a parameter, we find the corresponding Fisher information to be 4 times the unit operator. However, even for parametric families (4) seems to be an interesting form of a global approximation: if f 1/ ϑ is taken as parameter, then the resulting Gaussian model has a simple form. One recognizes that the heteroscedastic nature of (5) derives only from the curved nature of a general parametric family within the space of roots of densities. This observation was in fact made earlier by Le Cam (1985). In his Theorem 4.3 he established the homoscedastic global Gaussian approximation for i.i.d. models in the finite dimensional case. We give a paraphrase of that result in a specialized form. A set in L 0 1 is said to have finite metric dimension if there is a number D such that every subset of which can be covered by an ε-ball can be covered by no more than D balls of radius ε/, where D does not depend on ε. A set of densities f has this property in Hellinger metric if the corresponding set of f 1/ has it in L (0, 1). Proposition 1. [Le Cam (1985)]. Let be a set of densities on 0 1 having finite dimension in Hellinger metric and fulfilling a further regularity condition (see Section 10). Then the experiments given by observations (3) and (4) with f are asymptotically equivalent. The actual formulation in Le Cam (1985) is more abstract and general, giving a global asymptotic normality in the i.i.d. case for arbitrary random variables, in particular without assumed existence of densities; but finite dimensionality is essential. This result in its conceptual clarity and potential impact seems not to have been well appreciated by researchers; the heteroscedastic form (5) under classical regularity conditions is somewhat better known [cf. Mammen (1986)]. Our main result can thus be viewed as an extension of Le Cam s Proposition 1. to a nonparametric setting. The value 1 of the Hölder exponent α seems to be a critical one; for discretization of Gaussian white noise, this has been shown by Brown and Low (1996). White noise models with fixed variance do occur as local limits of experiments in root-n consistent nonparametric problems [Millar (1979)], and, via specific renormalizations, also in non-root-n consistent curve estimation [Low

40 M. NUSSBAUM (199) and Donoho and Low (199)]. Thus various central limit theorems for i.i.d. experiments can be embedded in a relatively simple and closed-form approximation by (4). Moreover, for the density f itself and for log f we also give Gaussian approximations which are heteroscedastic in analogy to (5); see Remark.8 and Corollary 3.3. The paper is organized as follows. The basic results are developed in an overview fashion in Sections and 3, which may suffice for a first reading. By default, proofs or technical comments for all statements are to be found in Sections 4 10. In Section we develop the basic approximation of likelihood ratios over shrinking neighborhoods of a given density f 0. These neighborhoods n f 0 are already nonparametric, in the sense of shrinking slower than n 1/. For proving this, we partition the sample space 0 1 into small intervals and obtain a product experiment structure via Poissonization. The Gaussian approximation is then argued via the space-local empirical process on the small intervals; piecing this together on 0 1 yields the basic parameter-local Gaussian approximation over f n f 0. Once in a Gaussian framework, we manipulate likelihood ratios to obtain other approximations, in particular the one with trend f 1/. For these experiments, which are all Gaussian, we use the methodology of Brown and Low (1996), who compared the white noise model with its discrete version (the Gaussian regression model). It remains to piece together the parameter-local approximations using a preliminary estimator; this is the subject of Section 3. Our method of globalization is somewhat different from Le Cam s, which works in the parametric case; the concept of metric entropy or dimension and related theory are not utilized. But obviously these methods, which already have proved fruitful in nonparametrics, have a potential application here as well. Our results are nonconstructive in spirit; that is, they are estimates of the -distance which imply asymptotic risk bounds. The question of a constructive recipe for procedures, as obtained by Brown and Low (1996), is more complex to treat in the present case. Nevertheless, application to the theory of asymptotic minimax constants in nonparametrics is possible. We do not develop this here, but a possible first exercise would be to derive the result of Efroimovich and Pinsker (198) on density estimation with L -loss over Sobolev classes from Pinsker s (1980) result in the white noise model. Since the deficiency distance refers to loss functions which are uniformly bounded, one would use the version of Pinsker s result for bounded L -related loss [cf. Tsybakov (1994)]. For a recent account of exact constants for L -loss over ellipsoids see Belitser and Levit (1995). This application is limited in scope by the fact that our critical smoothness 1/ is for Hölder classes, so that for Sobolev classes it comes out as 1. For Gaussian nonparametric regression [Brown and Low (1996)], the limit is actually 1/ in the Sobolev (or Besov B 1/ ) sense, but it is not clear if this holds in the i.i.d. model. Another result which can be carried over to density estimation is the L -analog of Pinsker s constant, found by Korostelev (1993). The details for this case, where the Hölder classes are natural, are developed in Korostelev and Nussbaum (1996).

ASYMPTOTIC EQUIVALENCE 403 As a basic text for the asymptotic theory of experiments we refer to Strasser (1985). We use C as a generic notation for positive constants; for sequences the symbol a n b n means the usual equivalence in rate, while a n b n means a n = b n 1 + o 1.. The local approximation. Our first Gaussian approximation will be established in a parameter-local framework. Suppose we have i.i.d. observations y i, i = 1 n with distribution P f having Lebesgue density f on the interval 0 1, and it is known a priori that f belongs to a set of densities. Henceforth in the paper we will set = α M ε for some ε > 0 M > 0 and α > 1/. Let p denote the norm in the space L p (0, 1), 1 p. Let γ n be the sequence (6) γ n = n 1/4 log n 1 and for any f 0 define a class n f 0 by { } n f 0 = f f (7) 1 f γ n 0 For given f 0 we define a local (around f 0 ) product experiment ( ) (8) E 0 n f 0 = 0 1 n n 0 1 P n f n f 0 Let F 0 be the distribution function corresponding to f 0, and let K f 0 f = f log f f 0 df 0 be the Kullback Leibler relative entropy. Let W be the standard Wiener process on 0 1, and consider an observed process (9) y t = t 0 log f f 0 F 1 0 u du + tk f 0 f + n 1/ W t t 0 1 Let Q n f f0 be the distribution of this process on the function space C 0 1 equipped with its Borel σ-algebra C 0 1, and let (10) E 1 n f 0 = ( ) C 0 1 C 0 1 Q n f f0 f n f 0 be the corresponding experiment when f varies in the neighborhood n f 0. Theorem.1. Define n f 0 as in (7) and (6). Then E 0 n f 0 E 1 n f 0 0 as n uniformly over f 0.

404 M. NUSSBAUM The proof is based upon the following principle, described in Le Cam and Yang [(1990), page 16]. Consider two experiments E i = i, i, P i ϑ ϑ, i = 0, 1, having the same parameter set. Assume there is some point ϑ 0 such that all the P i ϑ are dominated by P i ϑ0, i = 0, 1, and form i ϑ = dp i ϑ /dp i ϑ0. Consider i = i ϑ, ϑ as stochastic processes indexed by ϑ given on the probability space i, i, P i ϑ0. By a slight abuse of language, we call these the likelihood processes of the experiments E i (note that the distribution is taken under P i ϑ0 here). Suppose also that there are versions i of these likelihood processes defined on a common probability space (,, P). Proposition.. The deficiency distance (E 0, E 1 ) satisfies E 0 E 1 sup E P 0 ϑ 1 ϑ ϑ Proof. It is one of the basic facts of Le Cam s theory that, for dominated experiments, the equivalence class is determined by the distribution of the likelihood processes under P i ϑ0 when ϑ 0 is assumed fixed. This means that in the above framework we have E 0 E 1 = 0 iff 0 P 0 ϑ0 = 1 P 1 ϑ0. Thus, if we construct an experiment E i with likelihood process i, we obtain equivalence: E i, E i = 0. The random variables i ϑ on P have the same distributions as i ϑ on i, i, P i ϑ0, for all ϑ ; hence they are positive and integrate to 1. They may hence be considered as P-densities on, indexed by ϑ. These densities define measures Pi ϑ on (, ) and experiments E i =,, Pi ϑ, ϑ, i = 0, 1. By construction, the likelihood process for E i is i ϑ, so E i, E i = 0, i = 0, 1. Hence E 0, E 1 = E 0, E 1, and E 0 and E 1 are given on the same measurable space (, ). In this case, an upper bound for the deficiency distance is E 0 E 1 sup P0 ϑ P 1 ϑ ϑ where is the total variation distance between measures [in (68), Section 9, take the identity map as a transition M]. But P0 ϑ P 1 ϑ coincides with E P 0 ϑ 1 ϑ, which is just an L 1-distance between densities. The argument may be summarized as follows: versions i of the likelihood processes on a common probability space generate (equivalent) versions of the experiments on a common measurable space for which i ϑ are densities. Their L 1 -distance bounds the deficiency. When i ϑ are considered as densities it is natural to employ also their Hellinger distance H ; extending the notation we write H 0 ϑ 1 ϑ = E ( P 0 ϑ 1/ 1 ϑ 1/) (11) Making use of the general relation of Hellinger distance to L 1 -distance, we obtain (1) E 0 E 1 4 sup H 0 ϑ 1 ϑ ϑ

ASYMPTOTIC EQUIVALENCE 405 In the sequel we will work basically with this relation to establish asymptotic equivalence. For our problem, we identify ϑ = f ϑ 0 = f 0, = n f 0, P 0 ϑ = P n f and P 1 ϑ = Q n f f0. Furthermore, we represent the observations y i as y i = F 1 z i, where z i are i.i.d. uniform (0, 1) random variables and F is the distribution function for the density f (note that F is strictly monotone for f. Let U n be the empirical process of z 1 z n, that is, U n t = 1 n χ 0 t z i t n i=1 t 0 1 Note that E 0 n f 0 is dominated by P n f 0 ; then the likelihood process is { } n f 0 n f f 0 = exp log F0 1 f z i 0 Defining (13) and observing that i=1 { } f λ f f0 t = log F0 1 f t 0 λ f f0 t dt = K f 0 f we then have the following representation: { 0 n f f 0 = exp n λ f f0 t 1 } (14) du n t nk f 0 f n This suggests a corresponding Gaussian likelihood process: substitute U n by a Brownian bridge B and renormalize to obtain integral 1. We thus form, for a uniform(0, 1) random variable Z, (15) 1 n f f 0 = exp { n λ f f0 t 1 n db t n Var λ f f 0 Z For an appropriate standard Wiener process W we have λ f f0 t db t = λ f f0 t + K f 0 f dw t By rewriting the likelihood process 1 n f f 0 accordingly, we see that it corresponds to observations (9) or, equivalently, to (16) dy t = λ f f0 t + K f 0 f dt + n 1/ dw t } t 0 1 at least when the parameter space is. Thus 1 n f f 0 is in fact the likelihood process for E 1 n f 0 in (10). To find nearby versions of these likelihood processes, fulfilling (17) sup H ( 0 n f f 0 1 n f f 0 ) 0 f n f 0

406 M. NUSSBAUM it would be natural to look for versions of U n and B on a common probability space (U n and B n, say) which are close, such as in the classical Hungarian construction [see Shorack and Wellner (1986), Chapter 1, Section 1, Theorem ]. However, the classical Hungarian construction [Komlós Major Tusnady (KMT) inequality] gives an estimate of the uniform distance U n B n which for our purpose is not optimal. The reason is that the uniform distance may be construed as U n B n = sup g U n g B n g where is a class of indicators of subintervals of 0 1. Considering more general classes of functions leads to functional KMT type results [see Koltchinskii (1994) and Rio (1994)]. However, for an estimate (17) we need to control the random difference U n g B n g only for one given function (λ f f0 in this case), with a supremum over a function class only after taking expectations [cf. the remark of Le Cam and Yang (1990), page 16]. Thus for our purpose we ought to use a functional KMT type inequality for a one-element function class = g, but where the same constants and one Brownian bridge are still available over a class of smooth g. Such a result is provided by Koltchinskii [(1994), Theorem 3.5]. We present a version slightly adapted for our purpose. Let 0 1 be the space of all square integrable measurable functions on 0 1, and let 1/ H be the seminorm associated with a Hölder condition with exponent 1/ in the L -sense (see Section 5 for details). Proposition.3. There are a probability space,, P) and a number C such that for all n there are versions of the uniform empirical process U n g and of the Brownian bridge B n g, g 0 1, such that for all g with g <, g 1/ H < and, for all t 0, P ( n 1/ U n g B n g C g + g 1/ H t + log n log 1/ n ) C exp t Specializing g = λ f f0 λ f f0, we come close to establishing relation (17) for the likelihood processes, but we need an assumption that the neighborhoods n f 0 shrink with rate o n 1/3. Comparing this with the usual nonparametric rates of convergence, we see that such a result is useful only for smoothness α > 1. To treat the case α > 1/, however, we need neighborhoods of size o n 1/4. To obtain such a result, it is convenient, rather than using the Hungarian construction globally on 0 1, to subdivide the interval and use a corresponding independence structure (approximate or exact) of both experiments. In this connection the following result is useful [see Strasser (1985), Lemma.19]. Lemma.4. Suppose that P i and Q i are probability measures on a measurable space ( i, i, for i = 1 k. Then ( k ) H k k P i Q i H P i Q i i=1 i=1 i=1

ASYMPTOTIC EQUIVALENCE 407 Consider a partition of 0 1 into subintervals D j. The Gaussian experiment E 1 n f 0 has a convenient independence structure: in the representation (16), observations on the signal λ f f0 t + K f 0 f are independent on different pieces D j. A corresponding approximate product structure for the i.i.d. experiment E 0 n f 0 will be established by Poissonization. Let E 0 j n f 0 be the experiment given by observing interval censored observations (18) y i χ Dj y i y i i. i. d. with density f, i = 1 n with f n f 0. We use the symbol for products of experiments having the same parameter space. Proposition.5. Let k n be a sequence with k n, and consider a partition D j = j 1 /k n j/k n j = 1 k n. Then uniformly over f 0. (19) Our choice of k n will be ( E 0 n f 0 k n j=1 ) E 0 j n f 0 0 k n n 1/ / log 3/ n For each D j we form a local likelihood process 0 j n f f 0, as the likelihood process for observations in (18) for given j, and establish a Gaussian approximation like (17) with a rate. Let A j = F 0 D j and let E 1 j n f 0 be the Gaussian experiment (0) dy t = χ Aj t λ f f0 t + K f 0 f dt + n 1/ dw t t 0 1 with parameter space n f 0. Let 1 j n f f 0 be the corresponding likelihood process. Proposition.6. On the probability space P of Proposition.3, there are versions i j n f f 0, i = 0, 1, such that (1) sup H ( 0 j n f f 0 1 j n f f 0 ) = O γn log n 3 f n f 0 uniformly over j = 1 k n and f 0. This admits the following interpretation. Define m n = n/k n ; in our setting this is the stochastic order of magnitude of the number of observations y i falling into D j. Thus for the local likelihood process 0 j n f f 0 the number m n represents an effective sample size in a rate sense. In view of (6) and (19) we have γ n mn 1/ log n 1/4, and since this is the shrinking rate of n f 0 in the uniform norm, it is also the shrinking rate of this set of densities restricted to D j, and of the corresponding set of conditional densities. Thus in a sense

408 M. NUSSBAUM we are almost in a classical setting with sample size m n and a root-m n shrinking neighborhood. Result (1) implies () E 0 j n f 0 E 1 j n f 0 = O m 1/ n log n 5/4 that is, we have a root-m n rate up to a log term. Note that here we have introduced a space-local aspect in addition to the already present parameter-local one. In piecing together these space-local approximations, we will crucially use the product measure estimate of Lemma.4. This motivates our choice to work with the Hellinger distance for the likelihood processes construed as densities. Proof of Theorem.1. exactly: The Gaussian experiment E 1 n f 0 decomposes ( E 1 n f 0 k n j=1 According to (1) and Lemma.4 we have ( kn E 0 j n f 0 j=1 4 sup f n f 0 k n j=1 ) E 1 j n f 0 ) E 1 j n f 0 = 0 kn j=1 H( 0 j n f f 0 1 j n f f 0 ) By Proposition.6 this is bounded by ) ) O (k n γ n log n 3 = O ( log n 1/ = o 1 and these estimates hold uniformly over f 0. Low (199) considered experiments given by local (on D j ) perturbations of a fixed density f 0 and applied a local asymptotic normality argument to obtain strong convergence to a Gaussian experiment. This amounts to having () without a rate, and it is already useful for a number of nonparametric decision problems, like estimating the density at a point. Golubev (1991) used a similar argument for treating estimation with L -loss. We are now able to identify several more asymptotically equivalent models. This is based on the following reasoning, applied by Brown and Low (1996) to compare Gaussian white noise models. Consider the measure of the process n 1/ W t, t 0 1, shifted by a function t 0 g i i = 1,, where g i 0 1 ; call these measures P i. Then (3) H P 1 P = ( { 1 exp n }) 8 g 1 g If (g i ϑ, ϑ ), i = 1,, are two parametric families, then the respective experiments are asymptotically equivalent if g 1 ϑ g ϑ = o n 1/ uniformly

ASYMPTOTIC EQUIVALENCE 409 over ϑ. In the Gaussian experiment E 1 n f 0 of (16), the shift is essentially a log-density ratio. We know that log f/f 0 is small over f n f 0 ; expanding the logarithm, we get asymptotically equivalent experiments with parameter space n f 0. Accordingly, let E n f 0 be the experiment given by observations (4) dy t = f t f 0 t dt + n 1/ f 1/ 0 t dw t t 0 1 with parameter space n f 0, and let E 3 n f 0 correspondingly be given by (5) dy t = f 1/ t f 1/ 0 t dt + 1 n 1/ dw t t 0 1 Theorem.7. The experiments E i n f 0, i = 1,, 3, are asymptotically equivalent, uniformly over f 0. Remark.8. The equivalence class of E n f 0 is not changed when the additive term f 0 t dt in (4) is omitted, since this term does not depend on the parameter f, and omitting it amounts to a translation of the observed process y by a known quantity. Moreover, in the proof below it will be seen that in the representation (16) of E 1 n f 0 the term K f 0 f dt is asymptotically negligible. Analogous statements are true for the other variants; hence locally asymptotically equivalent experiments for f n f 0 (with uniformity over f 0 ) are also given by the following: (6) (7) y i i = 1 n i.i.d. with density f; dy t = log f F 1 0 t dt + n 1/ dw t t 0 1 (8) dy t = f t dt + n 1/ f 1/ 0 t dw t t 0 1 (9) dy t = f 1/ t dt + 1 n 1/ dw t t 0 1 Note that (8) is related to the weak convergence of the empirical distribution function F n, n 1/ ( F n F ) B F Indeed, arguing heuristically, when F is in a shrinking neighborhood of F 0 we have B F B F 0, while F n is a sufficient statistic. We obtain F n F + n 1/ B F 0 which suggests a Gaussian accompanying experiment (8). This reasoning is familiar as a heuristic introduction to limiting Gaussian shift experiments, when neighborhoods are shrinking with rate n 1/. However, our neighborhoods f n f 0 are larger [recall γ n = n 1/4 log n 1 ].

410 M. NUSSBAUM 3. From local to global results. The local result concerning a shrinking neighborhood of some f 0 is of limited value for statistical inference since usually such prior information is not available. Following Le Cam s general principles, we shall construct an experiment where the prior information is furnished by a preliminary estimator, and subsequently the local Gaussian approximation is built around the estimated parameter value. To formalize this approach, let N n define a fraction of the sample size; that is, N n is a sequence N n N n < n, and consider the corresponding fraction of the sample y 1 y Nn. Then let fˆ n be an estimator of f based on this fraction, fulfilling (with P n f the pertinent measure) (30) inf f P n f ˆ f n n f 1 The set must be such that the shrinking rate of n f is an attainable rate for estimators. If f has a bounded derivative of order α, we have for f an attainable rate in sup norm n/ log n α/ α+1 [see Woodroofe (1967)]. The required sup norm rate is γ n = o n 1/4 ; this corresponds to α > 1/. Thus we may expect for the Hölder smoothness classes assumed here that the rate γ n is attainable if the size N n of the fraction is sufficiently large. We will allow for a range of choices: (31) n/ log n N n n/ Define E 0 n to be the original i.i.d. experiment (3) with global parameter space. Lemma 3.1. Suppose (31) holds. Then in E 0 n there exists a sequence of estimators fˆ n depending only on y 1 y Nn fulfilling (30). One may assume that for each n the estimator takes values in a finite subset of. The proof is in Section 8. The following construction of a global approximating experiment assumes such an estimator sequence fixed. The idea is to substitute fˆ n for f 0 in the local Gaussian approximation and to retain the first fraction of the i.i.d. sample. Recall that our local Gaussian approximations were given by families Q n f f0, f n f 0 [cf. (10)]. Note that f n f 0 is essentially the same as f 0 n f. Accordingly we now consider the event ˆ f n n f and let f range in the unrestricted parameter space. We look at the second sample part, of size n N n, with its initial i.i.d family (P n N n f, f ). Based on the results of the previous section, we can hope that this family will be close, in the experiment sense, to the conditionally Gaussian family (Q n Nn f fˆ n, f ), on the event fˆ n n f. The measures Q n f fˆ n, which now depend on fˆ n, have to be interpreted as conditional measures, and we form a joint distribution with the first sample fraction. This idea is especially appealing when the locally approximating Gaussian measure Q n f f0 does not depend on the center f 0. In this case the resulting global experiment will have a convenient product structure, as we shall see.

ASYMPTOTIC EQUIVALENCE 411 This is the case with the variant (9) in Remark.8, when we parametrize with f 1/. To be more precise, define Q i n f f0, i = 1,, 3, to be the distributions of (y t, t 0 1 ) in (7) (9). Consider a compound experiment given by joint observations y 1 y Nn and y = y t t 0 1, where (3) (33) y 1 y Nn are i.i.d. with density f y y 1 y Nn = Q i n Nn f ˆ f n Here (33) describes the conditional distribution of y given y 1 y Nn. Define R i n f f ˆ to be the joint distribution of y 1 y Nn and y in this setup, for i = 1,, 3; the notation signifies dependence on the sequence of decision functions f ˆ = fˆ n n 1 (not dependence on the estimator value). Then the compound experiment is E i n ˆ f = ( 0 1 N n C 0 1 N n 0 1 C 0 1 R i n f ˆ f f ) Since Q 3 n f f0 = Q 3 n f does not depend on f 0, the measure R 3 n f f ˆ = R 3 n f does not depend on fˆ either and is just the product measure P N n f Q 3 n Nn f. We also write E 3 n f ˆ = E 3 n. The technical implementation of the above heuristic reasoning (see Section 9) gives the following result. Theorem 3.. Suppose (31) holds and let fˆ n be a sequence of estimators as in Lemma 3.1. Then, for i = 1,, 3, E 0 n E i n ˆ f 0 To restate this in a more transparent fashion, we refer to y 1 y Nn and y = y t, t 0 1 in (3) and (33) as the first and second parts of the compound experiment, respectively. Let ˆF n be the distribution function corresponding to the realized density estimator fˆ n. Corollary 3.3. Under the conditions of Theorem 3., the compound experiments with first part (34) y i and respective second parts (35) (36) y i i = 1 N n i.i.d. with density f i = N n + 1 n i.i.d.with density f dy t = log f ˆF 1 n t + n N n 1/ dw t t 0 1 (37) dy t = f t dt + n N n 1/ fˆ n 1/ t dw t t 0 1 (38) dy t = f 1/ t dt + 1 n N n 1/ dw t t 0 1 with f are all asymptotically equivalent.

41 M. NUSSBAUM For obtaining a closed-form global approximation, the compound experiment E 3 n [i.e., (34), (38)] is the most interesting one, in view of its product structure and independence of f. ˆ Here the estimator sequence fˆ only serves to show asymptotic equivalence to E 0 n ; it does not show up in the target experiment E 3 n itself. This structure of E 3 n suggests employing an estimator based on the second part for a next step. Lemma 3.4. Suppose (31) holds. Then in E 3 n there exists a sequence of estimators fˇ n depending only on y in (38) fulfilling (30). The second statement of Lemma 3.1 also applies. Note the similarity to Lemma 3.1. Here we exploit the well-known parallelism of density estimation and white noise on the rate of convergence level. Proof of Theorem 1.1. We choose N n = n/. On the resulting compound experiment E 3 n we may then operate again, reversing the roles of first and second part. We may in turn substitute y 1 y Nn by a white noise model, using a preliminary estimator based on (38). The existence of such an estimator is guaranteed by the previous lemma. Thus substituting y 1 y Nn by white noise leads to an experiment with joint observations dy 1 t = f 1/ t dt + 1 N 1/ n dw 1 t t 0 1 dy t = f 1/ t dt + 1 n N n 1/ dw t t 0 1 where W 1 and W are independent Wiener processes. A sufficiency argument shows this is equivalent to observing n i.i.d. processes, each distributed as dy t = f 1/ t dt + 1 dw t which in turn is equivalent to (4). t 0 1 4. Poissonization and product structure. For the proof of Proposition.5 we need some basic concepts from the theory of point processes [see Reiss (1993)]. A point measure on R is a measure µ 0 of form µ = i I µ xi, where I N, x i are points in R and µ x is Dirac measure at x. A point process is a random variable on a probability space P with values in the space of point measures M equipped with the appropriate σ-algebra [see Reiss (1993), page 6]. If Y = y i, i = 1, is a sequence of i.i.d. r.v. s, then the random measure µ 0 n = n i=1 µ yi is called an empirical point process. More generally if ν is a random natural number independent of Y, then µ = ν i=1 µ yi is a mixed empirical point process. In particular, if ν = π n is Poisson n, then µ n = π n i=1 µ y i is a Poisson process which has intensity function nf if y 1 has density f. If f and f 0 are two densities for y 1 such that P f P f0 and the law of ν is given, then it is possible to write down densities for the distributions f = µ P f of the mixed empirical point process µ. For the case of the empirical and the Poisson point process (ν = n or ν = π n ) we shall denote these distributions, respectively, by 0 n f and n f.

ASYMPTOTIC EQUIVALENCE 413 For observations ν, y i, i = 1 ν write the likelihood ratio for hypotheses (P f, ν ) versus (P f0, ν ), (39) ν i=1 f/f 0 y i = exp log f/f 0 dµ This is a function of µ which can be construed as a density of the point process law f on M f0 or as a likelihood process when f varies. Note that for different ν these densities are defined on different probability spaces, since the respective laws f0 differ. However, let P = 0 1 0 1 λ, where λ is Lebesgue measure on 0 1, and let Y and ν be defined on that space (as independent r.v. s). Then (39) also describes versions on the probability space P, which is common for different ν. For the case of the empirical and the Poisson point process (ν = n or ν = π n ) we shall denote these likelihood processes, respectively, by 0 n f f 0 ω and n f f 0 ω. The experiments defined by these versions construed as P-densities are then equivalent to the respective point process experiments, for any parameter space. In particular, the empirical point process experiment (with laws 0 n f ) is equivalent to the original i.i.d. experiment with n observations; µ 0 n = n i=1 µ yi is a sufficient statistic. For our particular parameter space n f 0 define the Poisson process experiment E n f 0 = M n f f n f 0 and recall the definition (8) of the i.i.d. experiment E 0 n f 0. Proposition 4.1. We have E 0 n f 0 E n f 0 0 uniformly over f 0. Proof. We use an argument adapted from Le Cam (1985). It suffices to establish that H 0 n f f 0 n f f 0 = O n 1/ γ n uniformly over f n f 0 f 0. With ν min = min π n n and ν max = max π n n we get ( n π n ) H 0 n f f 0 n f f 0 = E P f/f 0 1/ y i f/f 0 1/ y i i=1 ν min = E P f/f 0 y i i=1 ( νmax i=1 i=ν min +1 f/f 0 1/ y i 1)

414 M. NUSSBAUM Consider first the conditional expectation when π n is given; since y i are independent it is (( νmax ) ) E P f/f 0 1/ y i 1 π n i=ν min +1 This can be construed as the squared Hellinger distance of two product densities, one of which has ν max ν min = π n n factors and the other has as many factors equal to unity. Applying Lemma.4, we get an upper bound ν max i=ν min +1 E P (( f/f0 1/ y i 1 ) π n ) πn n γ n Taking an expectation and observing E π n n Cn 1/ completes the proof. If µ is a point process and D a measurable set, then define the truncated point process µ D B = µ B D B Let µ 0 n D and µ n D be truncated empirical and Poisson point processes on 0 1, respectively. The following Hellinger distance estimate is due to Falk and Reiss (199); see also Reiss [(1993), Theorem 1.4.]: (40) H µ 0 n D f µ n D f 3P f D Proof of Proposition.5. By the previous proposition it suffices to establish that ( k n ) (41) E n f 0 E 0 j n f 0 0 j=1 uniformly over f 0. In E 0 j n f 0 we observe n i.i.d. truncated random variables (18); their empirical point process is a sufficient statistic. Hence µ 0 n Dj (the truncated empirical point process for the original y i ) is a sufficient statistic also; let 0 j n f = µ 0 n Dj f be the corresponding law. It follows that each E 0 j n f 0 is equivalent to an experiment E 0 j n f 0 = M 0 j n f f n f 0 Let j n f = µ n Dj f be the law of the truncated Poisson point process and E j n f 0 = M j n f f n f 0 Then by the properties of the Poisson process, E n f 0 is equivalent to k n j=1 E j n f 0 It now suffices to show that ( kn E j n f 0 j=1 k n j=1 ) E 0 j n f 0 0

ASYMPTOTIC EQUIVALENCE 415 uniformly over f 0. From Lemma.4 and (40) we obtain ( kn ) H j n f 0 j n f j=1 k n j=1 k n j=1 H j n f 0 j n f 6 6 sup 1 j k n P f D j k n j=1 P f D j The functions f are uniformly bounded, in view of the uniform Hölder condition and f = 1. Hence P f D j 0 uniformly in f and j. 5. Empirical processes and function classes. From the point process framework we now return to the traditional notion of the empirical process as a normalized and centered random function. However, we consider processes indexed by functions. Let z i i = 1 n, be i.i.d. uniform random variables on 0 1. Then U n f = n 1/ ( n 1 n i=1 ) f z i f f 0 1 is the uniform empirical process. The corresponding Brownian bridge process is defined as a centered Gaussian random function B f f 0 1, with covariance ( ) EB f B g = fg f)( g f g 0 1 For any natural i, consider the subspace of 0 1 consisting of piecewise constant functions on 0 1 for a partition [ j 1 i, j i, j = 1 i. Let g i be the projection of a function g onto that subspace, and define, for natural K, ( K q K g = ) 1/ i g g i i=0 The following version of a KMT inequality is due to Koltchinskii [(1994), Theorem 3.5] (specialized to a single element function class there and to K = log n). Proposition 5.1. There are a probability space (,, P) and numbers C 1, C such that for all n there are versions U n and B n of the empirical process and of the Brownian bridge such that, for all g 0 1 with g 1 and for all x, y 0, (4) P ( n 1/ U n g B n g x + x 1/ y 1/ q log n g + 1 ) C 1 exp C x + n exp C y

416 M. NUSSBAUM To set q K g in relation to a smoothness measure, consider functions g 0 1 satisfying, for some C, (43) 1 h For a given g, define g H 1/ h g u + h g u du Ch holds; it is easy to see that H 1/ H 1/ with norm + H 1/ for all h > 0 as the infimum of all numbers C such that (43) is a seminorm. The corresponding space coincides with the Besov space B 1/ on 0 1 [see Nikolskij (1975), 4.3.3, 6.]. Furthermore [cf. Koltchinskii (1994), relation (4.5)], q K g 4K g H 1/ Proof of Proposition.3. If g fulfills g <, we divide by g, and apply (4); furthermore, we set y = x + C 1 log n, x = C 1 t and obtain, from (4), C 1 exp t P ( n 1/ U n g B n g g x + x 1/ y 1/ q log n g + g ) P n 1/ U n g B n g g x+x 1/ y 1/ g H 1/ log n 1/ + g P n 1/ U n g B n g C g + g 1/ H t + log n log n 1/ Lemma 5.. There is a C such that, for all f n f 0, f 0, λ f f0 Cγ n λ f f0 α C Proof. The first relation is obvious. For the second, note that F0 1 has derivative 1/f F 1 0, and since f ε, we have F 1 0 1 C. Now write as a difference of logarithms and again invoke f ε. λ f f0 Next we have to consider the likelihood ratio for interval censored observations (18). We shall do this for a generic interval D 0 1 of length k 1 n. We wish to represent the observations via the quantile function F0 1 in the usual fashion; we therefore assume D = F0 1 A, where A 0 1. Consider a class of intervals, for given C 1, C > 0, (44) n = A A = a 1 a 0 1 C 1 k n mes A C The assumption f 0 implies that f 0 is uniformly bounded and bounded away from zero. Hence mes D = kn 1 implies that A = F 0 D n for all f 0 and appropriately chosen C 1 and C. The technical development will now be carried out uniformly over all intervals A n. We shall set P f F 1 0 A = p, P f0 F 1 0 A = p 0. The corresponding log-likelihood ratio under f 0, expressed as a function of a uniform 0 1 variable z, is then λ f f0 A z, where (45) λ f f0 A t = χ A t log f f 0 F 1 0 t + 1 χ A t log 1 p 1 p 0

ASYMPTOTIC EQUIVALENCE 417 Since λ f f0 A has jumps at the endpoints of A, it is not in a Hölder class α M ; but it is in an L -Hölder class, so that we can ultimately estimate λ f 1/ f0 A H and apply the KMT inequality of Proposition.3. We first need some technical lemmas. Lemma 5.3. There is a C such that, for all f n f 0, f 0, A n, Proof. sup t A λ f f0 A t Cγ n sup t A c λ f f0 A t Ck 1 n γ n For t A we invoke the previous lemma. For t A c we estimate 1 p D f f 0 D p 0 D f f/f 0 1 f 0 0 D f γ n 0 In view of (44) we also have p 0 k 1 n 1/; hence 1 1 p 1 p = p 0 0 1 p 1 p (46) Ck 1 n 0 p γ n 0 This implies a similar estimate for log 1 p / 1 p 0 and thus yields the estimate for t A c. Lemma 5.4. There is a constant C such that, for all f n f 0, f 0, A n, λ f f 0 A C n log1/ n 1 λ f f0 A C n log 1/ n 1 Proof. From the previous lemma and (44) we obtain (47) λ f f 0 A = λ f f A 0 A + λ A c f f 0 A Ck 1 n n hence, in view of (6) and (19), n λ f f 0 A Cnk 1 n γ n C log n 1/ γ n + Ck γ n Ck 1 n γ n To prove the second relation, define ϕ t = exp λ f f0 A t ; then ϕ = 1, and Lemma 5.3 implies ϕ t 1 Cγ n uniformly. Hence n λ f f0 A = n log ϕ n 1 ϕ + C ϕ 1 = Cn ϕ 1 Here we treat the r.h.s. analogously to (47), using the fact that Lemma 5.3 remains true with ϕ 1 in place of λ, so that 48 n ϕ 1 C Lemma 5.5. There is a C such that, for all f n f 0, f 0, A n, λ f f0 A H 1/ Cγ n

418 M. NUSSBAUM (49) Proof. 1 h h It suffices to show λ f f0 A x + h λ f f0 A x dx Cγ n h for 0 < h < 1 Let A = a 1 a and define A 1 h = a 1 +h a h and A h = a 1 h a +h h 1 h (here A 1 h is empty for h > k n /). The integral in (49) over h 1 h will be split into integrals over A 1 h, A h \A 1 h and h 1 h \A h. According to Lemma 5., λ f f0 A fulfills a Hölder condition on A and is bounded by Cγ n, so that λ f f0 A x + h λ f f0 A x dx C min h α γn k 1 n A 1 h We have kn 1 γ n log7/ n in view of (6) and (19), so that for the above we obtain a bound Chγn min hα 1 h 1 γn log7/ n. Since α > 1/, the expression in the curly brackets tends to zero uniformly in 0 < h < 1/, as can be seen by distinguishing the cases h γ n and h γ n. For the second integral, we use the estimate λ f f0 A Cγ n implied by Lemma 5.3 and obtain λ f f0 A x + h λ f f0 A x dx Cγn h A h \A 1 h Finally, note that λ f f0 A is constant on 0 1 \ A, so that h 1 h \A h λ f f0 A x + h λ f f0 A x dx = 0 Thus (49) is established. 6. The local likelihood processes. Consider now the likelihood process for n observations (18) when D j is replaced by the generic subinterval D = F 1 0 A with A n from (44). With n i.i.d. uniform 0 1 variables z i we get an expression for the likelihood process { n } (50) 0 n f f 0 A = exp λ f f0 A z i i=1 for A = F 0 D j this is the same as 0 j n f f 0 as defined after (19). Let K f 0 f A = λ f f0 A t dt denote the pertaining Kullback information number. We assume that U n and B n are sequences of uniform empirical processes and Brownian bridges which both come from the Hungarian construction of Proposition.3. We obtain the representation [cf. (14) and Proposition.6, suppressing the notational distinction of versions] (51) 0 n f f 0 A = exp n 1/ U n λ f f0 A n K f 0 f A

ASYMPTOTIC EQUIVALENCE 419 The corresponding Gaussian likelihood ratio is [cf. (15)] { 1 n f f 0 A = exp n 1/ B n λ f f0 A n } (5) Var λ f f 0 A Z Consider also an intermediary expression, # n f f 0 A = exp n 1/ B n λ f f0 A n K f 0 f A The expression # n f f 0 A is not normalized to expectation 1, but we consider it as the density of a positive measure on the probability space P. The Hellinger distance H is then naturally extended to these positive measures. Lemma 6.1. There is a C such that, for all f n f 0, f 0, A n, E P i n f f 0 A C i = 0 1 E P # n f f 0 A C Proof. Define [for a uniform 0 1 variable Z] (53) T 10 = n K f 0 f A T 11 = n Var λ f f 0 A Z (54) T 0 = n 1/ U n λ f f0 A Since T 1 is a zero-mean Gaussian r.v., we have Hence T 1 = n 1/ B n λ f f0 A E P exp T 1 = exp 4T 11 E P 1 n = E P exp T 1 T 11 = exp T 11 exp n λ f f 0 A Now from Lemma 5.4 we obtain the assertion for i = 1. For the case i = 0, we get, from (50), { } n E P 0 n = E P exp λ f f0 A z i = E exp λ f f0 A Z n i=1 Now we have for ϕ t = exp λ f f0 A t E exp λ f f0 A Z = ϕ t dt = 1 + ϕ t 1 dt 1 + Cn 1 as a consequence of (48). Hence E P 0 n 1 + Cn 1 n exp C so that the lemma is established for i = 0. Finally, to treat E P # n, observe that Lemma 5.4 implies that T 10 and T 11 are uniformly bounded. Hence E P # n = E P 1 n exp T 1 T 11 C The next lemma is the key technical step, bringing in the Hungarian construction estimate of Proposition.3.

40 M. NUSSBAUM Lemma 6.. There is a C such that, for all f n f 0, f 0, A n, H 0 n f f 0 A # n f f 0 A C γ n log n 3/ Proof. Define T 0 = n 1/ B n U n λ f f0 A Combining Proposition.3 with Lemmas 5.3 and 5.5, we obtain P T 0 Cγ n t + log n log 1/ n C exp t Set t = t n = 4 log n and, for the above C, u n = 5Cγ n log 3/ n For an event B = B f f0 A = ω T 0 u n we obtain an estimate (55) P B c Cn 4 To treat H 0 n # n, split the expectation there into E P χ B and E P χ B c, and observe E P χ B c 1/ 0 n 1/ # n E P χ B c 0 n + # n ( P B c E P 0 n + # n ) 1/ According to the previous lemma E P 0 n + # n is uniformly bounded, so that (55) implies (56) E P χ B c 1/ 0 n 1/ # n Cn For the other part, observe that on ω B, in view of u n = o 1, 1 exp T 0 / Cu n so that, on ω B, 1/ 0 n 1/ # n = 1 exp T 0 / 0 n Cu n 0 n Since E P 0 n = 1, we obtain E P χ B 1/ 0 n 1/ # n Cu n This completes the proof in view of (56) and n = o u n. Lemma 6.3. For all f n f 0, f 0, A n, H 0 n f f 0 A 1 n f f 0 A H 0 n f f 0 A # n f f 0 A

Proof. H # n, 1 n is the distance of 1/ # n ASYMPTOTIC EQUIVALENCE 41 Consider the space of random variables L,, P and note that in that space. Furthermore, and 1/ 1 n 1/ 1 n = 1/ # n E P # n 1/ is the element of the unit sphere of L P closest to 1/ # n. Since 1/ on the unit sphere, we have and therefore H # n 1 n H # n 0 n H 0 n 1 n H 0 n # n + H # n 1 n H 0 n # n 0 n is Now let A = A j = F 0 D j and consider also the likelihood process 1 j n f f 0 of the Gaussian experiment E 1 j n f 0 of (0). Notice that this differs from 1 n f f 0 A j [cf. (5) and (45)]. We consider versions of both likelihood processes which are functions of the Brownian bridge version B. Lemma 6.4. There is a C such that, for all f n f 0, f 0 and j = 1 k n, H 1 n f f 0 A j 1 j n f f 0 Cγ n log 3/ n Proof. The likelihood process 1 n f f 0 A j is 1 n f f 0 from (15) with replaced by λ f f0 A j, so it corresponds to a Gaussian model λ f f0 dy t = λ f f0 A j t + K f 0 f A j dt + n 1/ dw t t 0 1 with f n f 0 [cf. (16)]. Moreover, 1 j n f f 0 corresponds to the Gaussian model (0). Hence the distance H between the likelihood processes on P equals the Hellinger distance between the two respective shifted Wiener measures. We may apply (3), setting g 1 = λ f f0 A j λ f f0 A j g = χ Aj (λ f f0 λ f f0 ) To estimate g 1 g, we note that, in view of (13) and (45), g 1 g = χ A c j λ f f0 A j g where g = λ f f0 A j + χ Aj λ f f0 We claim (57) g Cn 1 γn uniformly. Indeed, using the expansion (58) log x = log 1 + x 1 = x 1 1 x 1 + o x 1 and setting x = f/f 0 F0 1 t and λ f f 0 = f/f 0 F 1 0 t 1, we note that, for f n f 0, (59) λ f f0 t = λ f f0 t + O γn

4 M. NUSSBAUM uniformly. Since λ f f0 = 0, we obtain (60) λ f f0 = K f0 f = λ f f0 λ f f0 λ f f0 λ f f0 = O γn Furthermore, according to Lemma 5.4, λ f f0 A j Cn = n 1 γn 4 log4 n In conjunction with (60) this implies g C n 1 γ 4 n log4 n + mes A j γ 4 n where mes A j = p 0 = P f0 D j. Using p 0 Ck 1 n we obtain (57). Furthermore, kn 1 γ4 n n 1 log n 1/ γn and g 1 g + g = χ A c j λ f f0 A j = 1 p 0 log 1 p 1 p 0 where p = P f D j. Using (46) we find (61) g 1 g + g Ck n γ n = Cn 1 γn log3 n By (3) the squared Hellinger distance is ( { 1 exp n }) 8 g 1 g and the lemma follows from (57) and (61). Proof of Proposition.6. Consider 0 n f f 0 A for A = A j and identify this with 0 j n f f 0. Identify 1 j n f f 0 of Lemma 6.4 to 1 j n f f 0. The result then follows from Lemmas 6. 6.4. 7. Further local approximations. Define functions λ 1 f f0 = λ f f0 + K f 0 f λ f f0 = f/f 0 1 F 1 0 λ 3 f f0 = f/f 0 1/ 1 F 1 0 and experiments E # i n f 0 given by observations (6) dy t = λ i f f0 t dt + n 1/ dw t t 0 1 and parameter space f n f 0, for i = 1,, 3. We have seen that E # 1 n f 0 = E 1 n f 0 [cf. (16)]. Lemma 7.1. We have ( E # i n f 0 E i n f 0 ) = 0 i = 1 3

ASYMPTOTIC EQUIVALENCE 43 Proof. The likelihood process for E # i n f 0 is { 1 i n f f 0 = exp n λ i f f0 n dw n } λ i f f 0 i = 1 3 Define a process W t = t 0 f 1/ 0 d W F 0 This is a centered Gaussian process with independent increments and variance at t given by t 0 f 1 0 df 0 = t. Hence W is a Wiener process, and we have, for every continuous g on 0 1, gf 1/ 0 dw = g d W F 0 Utilizing W in (4), we get a likelihood process for E n f 0, { exp n f f 0 f0 1 n 1/ f 1/ 0 dw n } f f 0 f 1 0 { ( ) f = exp n 1 n 1/ d W F f 0 n ( ) f } 1 df 0 f 0 0 = n f f 0 Similarly for E 3 n f 0 we obtain a likelihood process, { exp 4n f 1/ f 1/ 0 1 n 1/ dw 4n } f 1/ f 1/ 0 { (( ) f 1/ = exp n 1) n 1/ d W F f 0 4n (( ) f 1/ } 1) df 0 f 0 0 = 3 n f f 0 Proof of Theorem.7. It now remains to apply (3) to the measures given by (6) when f n f 0. We have to prove (63) sup λ 1 f f0 λ i f f0 = o n 1 f n f 0 for i =, 3, uniformly over f 0. Now (59) and (60) in the proof of Lemma 6.4 imply λ f f0 + K f 0 f λ f f0 = O γ4 n = O n 1 log n 4 which proves (63) for i =. For i = 3, note first that for f n f 0 we have f/f 0 1/ 1 = O γ n and use (58) with x = f/f 0 1/ F0 1 t to obtain (64) λ f f0 t = log f/f 0 1/ F 1 0 t = λ 3 f f 0 t + O γ n uniformly. Now (64) and (60) imply (63) for i = 3.

44 M. NUSSBAUM 8. The preliminary estimator. Consider first an estimator based on the whole sample. For attainable rates in the uniform norm we can employ the kernel-type estimators used in the more delicate exact constant theory. Let ψ n = log n/n α/ α+1. The following result is shown in Korostelev and Nussbaum [(1996), Section 4]. Lemma 8.1. such that In the experiment E 0 n there is an estimator f n and a κ > 0 ( sup P n f f n f ) κ ψ n 0 f Proof of Lemma 3.1. Consider the estimator applied to a sample fraction y i, i = 1 N n ; call it f Nn. Then, since α > 1/, ψ Nn = Nn 1 log N n α/ α+1 n 1 log n/ log n α/ α+1 = o γ n This immediately implies ( ) P n f sup f t f (65) Nn t > cγ n 0 for all c > 0 sup f t 0 1 It is easy to verify that the quantity (66) µ = sup f f is finite; this is a consequence of Hölder continuity in conjunction with f = 1. Note that the set is compact in the uniform metric: indeed it is equicontinuous and uniformly bounded according to (66), so compactness is implied by the Arzela Ascoli theorem. Now cover, by a finite set of uniform γ n -balls with centers in, and define 0 n to be the set of the centers. Define fˆ n as the element in 0 n closest to f Nn (or in case of nonuniqueness, select an element measurably). Analogously, for f select a closest element g f 0 n. Then we have f ˆ n f ˆ f n f Nn + f Nn f g f f Nn + f Nn f g f f + f Nn f f Nn f + γ n Hence fˆ n also satisfies (65), and it takes values in the finite set 0 n. From this we obtain immediately ( ) P n f sup f t / f Nn t 1 > γ n 0 sup f t 0 1 in view of the uniform bound f t ε for f.