Asymptotic properties of density estimates for linear processes: application of projection method

Similar documents
Pointwise convergence rates and central limit theorems for kernel density estimators in linear processes

Additive functionals of infinite-variance moving averages. Wei Biao Wu The University of Chicago TECHNICAL REPORT NO. 535

Convergence rates in weighted L 1 spaces of kernel density estimators for linear processes

STA205 Probability: Week 8 R. Wolpert

Some functional (Hölderian) limit theorems and their applications (II)

Nonparametric Density Estimation fo Title Processes with Infinite Variance.

Research Article Exponential Inequalities for Positively Associated Random Variables and Applications

Density estimators for the convolution of discrete and continuous random variables

On the Set of Limit Points of Normed Sums of Geometrically Weighted I.I.D. Bounded Random Variables

. Find E(V ) and var(v ).

UNIT ROOT TESTING FOR FUNCTIONALS OF LINEAR PROCESSES

Strong approximation for additive functionals of geometrically ergodic Markov chains

Nonparametric regression with martingale increment errors

QED. Queen s Economics Department Working Paper No. 1244

On the convergence of sequences of random variables: A primer

Random Bernstein-Markov factors

Central Limit Theorem for Non-stationary Markov Chains

On a class of stochastic differential equations in a financial network model

Spectral representations and ergodic theorems for stationary stochastic processes

The largest eigenvalues of the sample covariance matrix. in the heavy-tail case

Bahadur representations for bootstrap quantiles 1

Convergence at first and second order of some approximations of stochastic integrals

Selected Exercises on Expectations and Some Probability Inequalities

Kernel estimation for time series: An asymptotic theory

Stat 5101 Notes: Algorithms

Model Specification Testing in Nonparametric and Semiparametric Time Series Econometrics. Jiti Gao

Stochastic Optimization with Inequality Constraints Using Simultaneous Perturbations and Penalty Functions

PROBABILITY: LIMIT THEOREMS II, SPRING HOMEWORK PROBLEMS

Asymptotic Statistics-III. Changliang Zou

Gaussian vectors and central limit theorem

STAT 200C: High-dimensional Statistics

d(x n, x) d(x n, x nk ) + d(x nk, x) where we chose any fixed k > N

3 Integration and Expectation

Smooth nonparametric estimation of a quantile function under right censoring using beta kernels

Mod-φ convergence I: examples and probabilistic estimates

Quantile Processes for Semi and Nonparametric Regression

4 Sums of Independent Random Variables

1. Aufgabenblatt zur Vorlesung Probability Theory

Reduction principles for quantile and Bahadur-Kiefer processes of long-range dependent linear sequences

n! (k 1)!(n k)! = F (X) U(0, 1). (x, y) = n(n 1) ( F (y) F (x) ) n 2

Brownian motion. Samy Tindel. Purdue University. Probability Theory 2 - MA 539

Goodness-of-fit tests for the cure rate in a mixture cure model

Continued Fraction Digit Averages and Maclaurin s Inequalities

If g is also continuous and strictly increasing on J, we may apply the strictly increasing inverse function g 1 to this inequality to get

Part II Probability and Measure

1. Stochastic Processes and Stationarity

Han-Ying Liang, Dong-Xia Zhang, and Jong-Il Baek

PROBABILITY: LIMIT THEOREMS II, SPRING HOMEWORK PROBLEMS

On large deviations for combinatorial sums

Optimal global rates of convergence for interpolation problems with random design

LARGE DEVIATION PROBABILITIES FOR SUMS OF HEAVY-TAILED DEPENDENT RANDOM VECTORS*

Probability and Measure

Conditional independence, conditional mixing and conditional association

RENORMALIZATION OF DYSON S VECTOR-VALUED HIERARCHICAL MODEL AT LOW TEMPERATURES

Ergodic Theorems. Samy Tindel. Purdue University. Probability Theory 2 - MA 539. Taken from Probability: Theory and examples by R.

Connection to Branching Random Walk

A Note on the Central Limit Theorem for a Class of Linear Systems 1

I forgot to mention last time: in the Ito formula for two standard processes, putting

µ X (A) = P ( X 1 (A) )

Part IA Probability. Theorems. Based on lectures by R. Weber Notes taken by Dexter Chua. Lent 2015

Introduction to Empirical Processes and Semiparametric Inference Lecture 02: Overview Continued

On detection of unit roots generalizing the classic Dickey-Fuller approach

A NOTE ON THE COMPLETE MOMENT CONVERGENCE FOR ARRAYS OF B-VALUED RANDOM VARIABLES

Asymptotic normality of the L k -error of the Grenander estimator

Brownian Motion. 1 Definition Brownian Motion Wiener measure... 3

SOME CONVERSE LIMIT THEOREMS FOR EXCHANGEABLE BOOTSTRAPS

Estimation of arrival and service rates for M/M/c queue system

Limiting behaviour of moving average processes under ρ-mixing assumption

Empirical Processes: General Weak Convergence Theory

Introduction to Stochastic processes

Consistency of the maximum likelihood estimator for general hidden Markov models

Lecture 4: Numerical solution of ordinary differential equations

n! (k 1)!(n k)! = F (X) U(0, 1). (x, y) = n(n 1) ( F (y) F (x) ) n 2

Weak invariance principles for sums of dependent random functions

Lecture 1: August 28

Probability Background

MAS223 Statistical Inference and Modelling Exercises

Nonlinear time series

NOTES AND PROBLEMS IMPULSE RESPONSES OF FRACTIONALLY INTEGRATED PROCESSES WITH LONG MEMORY

Section 27. The Central Limit Theorem. Po-Ning Chen, Professor. Institute of Communications Engineering. National Chiao Tung University

STAT 7032 Probability Spring Wlodek Bryc

Extreme Value Analysis and Spatial Extremes

Complete Moment Convergence for Weighted Sums of Negatively Orthant Dependent Random Variables

An almost sure invariance principle for additive functionals of Markov chains

Product measure and Fubini s theorem

GAUSSIAN PROCESSES; KOLMOGOROV-CHENTSOV THEOREM

Random Process Lecture 1. Fundamentals of Probability

Laplace s Equation. Chapter Mean Value Formulas

The main results about probability measures are the following two facts:

Lecture 4: Stochastic Processes (I)

Stat 5101 Lecture Slides: Deck 7 Asymptotics, also called Large Sample Theory. Charles J. Geyer School of Statistics University of Minnesota

Sums of exponentials of random walks

Stochastic Models (Lecture #4)

Lecture 9. d N(0, 1). Now we fix n and think of a SRW on [0,1]. We take the k th step at time k n. and our increments are ± 1

Adaptive Kernel Estimation of The Hazard Rate Function

Qualifying Exam in Probability and Statistics.

Refining the Central Limit Theorem Approximation via Extreme Value Theory

Asymptotic Inference for High Dimensional Data

Multivariate Analysis and Likelihood Inference

Almost Sure Convergence of the General Jamison Weighted Sum of B-Valued Random Variables

Transcription:

Nonparametric Statistics Vol. 17, No. 1, January February 005, 11 133 Asymptotic properties of density estimates for linear processes: application of projection method ARTUR BRYK and JAN MIELNICZUK * Warsaw University of Technology, Pl. Politechniki 1, 00-601 Warsaw, Poland Institute of Computer Science, Polish Academy of Sciences, Ordona 1, 01-37 Warsaw, Poland (Received December 003; revised in final form June 004) We specify conditions under which kernel density estimate for linear process is weakly and strongly consistent, and establish rates of its pointwise and uniform convergence. In particular, it is proved that for short-range dependent data of size n and bandwidth b n, the rate of convergence is O((log n/nb n ) 1/ + bn ). The results are established using projection method introduced in this setup by Ho and Hsing (Ho, H. C. and Hsing, T. (1996). On asymptotic expansion of the empirical process of long-memory moving averages. Annals of Statistics, 4, 99 104.) and Wu (Wu, W. B. (001). Nonparametric estimation for stationary processes, Ph.D. thesis, University of Michigan, available at http://www.stat.uchicago.edu/research/techreports.html.). Keywords: Long- and short-range dependence; Kernel density estimators; Linear process; Projection method; Rates of convergence 1. Introduction Let (X t ) be a stationary sequence with marginal density f, and ˆ f n (x) be a kernel density estimate of f(x)based on n observations X 1,X,...,X n given by fˆ n (x) = 1 nb n ( x Xt K where the kernel K is some function, not necessarily positive, such that K(s)ds = 1, and bandwidths (smoothing parameters) satisfy natural conditions b n 0 and nb n. Majority of asymptotic properties of fˆ n (x) are established for independent (X t ); however, recently, there has been an increasing interest in the dependent case. In one line of research, weak dependence assumptions (usually various types of mixing conditions) are imposed on (X t ), and it is shown that the results for independent data carry over to this more general case; b n ), Affiliated also with Polish Japanese Institute of Information Technology, Warsaw, Poland *Corresponding author. Email: miel@ipipan.waw.pl Nonparametric Statistics ISSN 1048-55 print/issn 109-0311 online 005 Taylor & Francis Ltd http://www.tandf.co.uk/journals DOI: 10.1080/10488550400067770

1 A. Bryk and J. Mielniczuk see ref. [1] for the overview of such results. Alternatively, one can consider a flexible submodel which restricts a class of considered stationary processes but not the strength of their dependence, and try to check when the analogy with independent case breaks down. This is especially interesting in the case of estimation of local parameters as a density function, because the answer to this question depends not only on the strength of dependence, but also on the effective number of observations used in the estimation, which is determined by magnitude of smoothing parameter. For representative examples of this line of research, we refer to refs [ 5]. Studied models include transformed stationary Gaussian sequences and infinite order moving averages, among others. In this article, we consider the latter, assuming throughout that X t = c i η t i, t = 1,,..., (1) i=0 where (η i ) i= are i.i.d. random variables with mean 0 and finite variance σ, and c i is such that i=0 c i <. We additionally assume that c 0 = 1 and that density f 1 of η 1 exists. If c i = L(i)i β, where 1/ <β<1, and L( ) is slowly varying at, routine calculation based on the Karamata theorem implies that r X (i) := Cov(X 0,X i ) C(β)L (i)i (β 1) σ, where C(β) := 0 (x + x ) β dx and a n b n means lim n a n /b n 1. Thus, in this case, sum of absolute values of covariances diverge. This property is called long-range dependence (LRD) or long-memory, in contrast to short-range dependence (SRD) case of absolutely summable covariances. Note that if i=0 c i <, orβ>1inthe hyperbolic decay condition given earlier, (X t ) is SRD. Let σn = Var(X 1 + +X n ) and put c i = 0 for i<0. Then, it is easily seen that σn = σ n ( n c ) t k is O(n) when i=0 c i < and σn D(β)n (β 1) L (n), where D(β) := σ [( β)(3/ β)] 1 C(β). In this article, we study conditions under which kernel density estimate is consistent in different senses and investigate rates of its a.s. convergence. In particular, we prove that, under natural conditions on bandwidths, fˆ n (x) is always weakly (i.e., in probability) consistent when density of η 1 is Lipschitz and K L. Moreover, for absolutely summable (c i ), rate of a.s. convergence of fˆ n (x) f(x) to0iso((log n/nb n ) 1/ + bn ). This parallels the result for independent data. A method which turns out to be particularly well suited to investigate such problems is projection method, two variants of which were introduced in the considered setup by Ho and Hsing [3] and Wu [6]. The method is briefly discussed in section. Both variants are presented here, as both of them turn out to be useful depending on a particular problem considered.applications of projection method to many statistical problems, including asymptotic behaviour of empirical processes and quantiles for linear processes and iterated dynamical systems, have been studied in depth by Wu in a series of papers [7 9]. Detailed analysis of asymptotic distributions of fˆ n (x) Efˆ n (x) is given by Wu and Mielniczuk [5]. Their results imply (cf. proof of their Theorem ) that when density f 1 of η 1 is three times differentiable with bounded continuous and integrable derivatives, then fˆ n (x) Efˆ n (x) = O P ((nb n ) 1/ + σ n /n). The method of proof also relied on martingale representation given in equation (5). The present article indicates that projection method is useful as well with investigating rates of a.s. and uniform convergence. Similar results to Theorem 4(a) on rates of uniform convergence based on strong mixing condition were established by Bosq [1] and Fan and Yao [10]. In particular, it follows from Theorem 5.3 of Chanda [11] that if strong mixing coefficients α(n) of an underlying stationary process decay geometrically, b n cn γ for 0 <γ <1, f is bounded on [a,b] and K satisfies Lipschitz condition, then sup x [a,b] fˆ n (x) Efˆ n (x) =O((log n/nb n ) 1/ ) in probability. Thus, the rate of convergence of the random component of fˆ n (x) coincides with that

Density estimation for linear processes 13 of Theorem 4(a). More generally, α(n) Cn β with β>5/ implies the same rate of convergence, provided the sequence (b n ) tends to 0 sufficiently slowly. One gets further insight into this problem by considering conditions under which linear processes are strongly mixing. There are many such results in the literature, e.g., refs [11 14]. In particular, for c n cn δ under certain regularity conditions on the[ density of η i, it follows from Withers [13] that the strong mixing coefficients α(k) = O n=k max(c1/3 n, ] C n log C n ) = O(k (4 δ)/3 ), where C k = i=k c i. Thus, the conditions of Fan and Yao when c n cn δ are satisfied for δ>3/4. At the same time, Theorem 4(a) does not require any conditions on the decay rate of (c n ); the sole condition being (c n ) l 1. Let us also note that as mixing coefficients measure departure from independence, asymptotic results based on mixing approach concern weak dependence case, and handling LRD case under such conditions seems infeasible. In the article, we restrict ourselves to linear processes and try to quantify the effect of dependence on the rates of convergence of fˆ n ( ).. Projection method Let (V t ) be a strictly stationary sequence of real random variables such that E V 1 <, and V t is F t -measurable, where (F t ) t= is an increasing sequence of σ -fields such that t i= F i is trivial for any t. In this section, we do not assume that V t is an infinite-order moving average. The main tool used in the article is the projection method, which exploits in different ways the following basic equality V t EV t = t E(V t F k ) E(V t F k 1 ), () holding a.s. owing to the fact that E(V t F j ) E(V t t i= F i) = E(V t ) a.s. as j. Let P k U = E(U Fk) E(U F k 1 ) denote projection differences. Note that the equality E(U F k ) = P k U + E(U F k 1 ) yields orthogonal decomposition of E(U F k ) with respect to F k 1. It follows from equation () that (V t EV t ) = t P k V t. (3) There are two possible ways of regrouping summands in equation (3). The first one consists in writing it formally as V t := U n,k, (4) P k using P k V l = 0 for l<k, whereas the second one represents equation (3) as k=0 P t k V t := W n,k. (5) In order to observe the main difference between equations (4) and (5), note that U n,k in equation (4) are sums of not necessarily uncorrelated summands such that U n,k and U n,k are uncorrelated for k = k. On the other hand, summands P t k V t,t = 1,...,n,ofW n,k are k=0

14 A. Bryk and J. Mielniczuk uncorrelated being martingale differences with respect to (F t k ) n,butw n,k may be correlated with W n,k for k = k. In what follows, Y =(E(Y )) 1/ denotes L -norm of a random variable Y. Observe that equality (4) leads to a more precise bound for L -norm of centered sum of V t than equation (5). Namely, using stationarity of (V t ) we get [6] (V t EV t ) = U n,k t=max(1,k) whereas analogous handling of equation (5) yields (V t EV t ) P t k V t = n 1/ k=0 k=1 P 1 V t k+1, (6) P 1 V k, which is obviously weaker than equation (6). This is the reason why representation (4), but not equation (5), is used to bound the variance, i.e., L -norm of centred density estimate in Lemma 1. In particular, if P 1 V t θ t for t 1, we have that the last expression in equation (6) is not greater than ( 0 ) ( ) θ t k+1 + θ t k+1 ( n+k k ) + n n =: n, (7) k=1 t=k where n = n i=1 θ i. In the case when i=1 θ i <,wehave n = O(n) as k=1 ( n+k k ) n( i=1 θ i). However, observe that it is often beneficial to exploit martingale structure of W n,k in equation (5), e.g., by taking advantage of known exponential inequalities for sums of martingale differences [15]. This is the approach used in Theorem 3 to study the rates of convergence of fˆ n (x). When using representation (5), summand W n,0 usually has to be handled differently from the terms W n,k for k>0 as it involves random variable V t which is not conditionally averaged in contrast to remaining terms. This leads to natural decomposition W n,0 + k>0 W n,k in which the second term is the sum of random variables V t = E(V t F t 1 ) EV t,t = 1,...,n for which decomposition (4) may be employed. The idea to use martingale approximations to investigate properties of S n = n V t dates back to Gordin and Lifszic [16] who considered this problem for V t = g(y t ), where Y t is an ergodic Markov chain. Note that for V t defined in equation (1), fˆ n (x) is a special case of this setup by letting Y t = (...,η t 1,η t ) and defining g = g n,byg n (Y t ) = n 1 K bn (x X t ). They proved that if E(S n X 1 ) h(x 1 ) in L, then h is a solution of the Poisson equation g(x) = h(x) Qh(x), where Qh(x) = h(y)q(x; dy) and Q(x, B) is a transition function of (Y t ). Then S n = n k 1 M k + R n, where M k = h(y k ) Qh(Y k 1 ) is the martingale difference with respect to F k 1 and R n = Qh(Y 0 ) Qh(Y n) is a remainder term. However, for problems considered in this article, weaker conditions are obtained when representation (5) is studied directly than those pertaining to existence of solution of the Poisson equation for g n defined earlier. k=1 3. Consistency and rates of almost sure convergence of kernel density estimates In Theorem 1, we investigate conditions under which weak and strong consistencies of ˆ f n hold. In both cases, no decay rate of coefficients c i is needed. Theorem provides asymptotic

Density estimation for linear processes 15 representation of centred fˆ n (x). In Theorem 3, rate of a.s. convergence of fˆ n (x) data is established for SRD and LRD case. Rate of uniform weak consistency over any finite interval is established in Theorem 4. Projection method is applied to rows of row-wise stationary array Y tn = n 1 (K bn (x X t ) Efˆ n (x)), t = 1,,...,n,n N, and F t = σ(...,η t 1,η t ) is the σ -field generated by all innovations up to the moment t. THEOREM 1 (a) Assume that the density f 1 of η 1 exists and is Lipschitz continuous with Lipschitz constant A, nb n and K (s) ds <. Then fˆ n (x) f(x)in probability. (b) Assume, moreover, that f 1 is bounded, nb n / log n and vk(v) dv <. Then fˆ n (x) f(x)a. s. The following theorem provides decomposition of fˆ n (x) Efˆ n (x) with explicit bound for a remainder term for c i s decaying hyperbolically. A weaker result under more stringent conditions on density f 1 and with remainder term O P (n 1/ ) + o P (σ n /n) has been proved by Wu and Mielniczuk (00). Let n be defined as in equation (7) with θ t = c t 1 ( i=t 1 c i )1/ and let f gdenote convolution of f( ) and g( ). A sequence (Z n ) n=1 of random variables converges to 0 completely if n=1 P( Z n >ε)< for any ε>0. THEOREM Assume that f 1 is bounded two times continuously differentiable function with bounded derivatives, and K satisfies assumptions of Theorem 1(b). Then ˆ f n (x) E ˆ f n (x) = M n (x) K b f (x)n 1 X t,t 1 + O ( n n ), (8) where M n (x) := n Y tn E(Y tn F t 1 ), X t,s := E(X t F s ) and n is defined in equation (7). M n (x) is a sum of martingale differences converging completely to 0 provided nb n / log n. Ifc i = L(i)i β for some β>1/, then n = O(n 1/ n i=1 i1/ β L (i) + n β L (n)) which is O(n β L (n)) for β<3/4 and O(n 1/ ) for β>3/4. Ifβ>3/4 and Eη1 4 <, the second term in equation (8) converges completely to 0. From Theorem, it follows that for twice continuously differentiable f we have fˆ n (x) f(x)= O P ((nb n ) 1/ + σ n /n + bn ). Next, we study a.s. rates and uniform rates in probability. Let C () (R) denote a family of twice continuously differentiable real functions. THEOREM 3 Assume that conditions of Theorem 1(b) are satisfied, f 1 C (R), E η 1 p < for some p>, K is symmetric and K(v) v dv<. (a) If i=1 c i < then ( (log ) ) fˆ n 1/ n (x) f(x) =O + bn a.s. (9) nb n (b) If c i = L(i)i β for 1/ <β<1, then ( (log ) ) fˆ n 1/ n (x) f(x) =O + σ n(log n) 1/ + bn a.s. (10) nb n n

16 A. Bryk and J. Mielniczuk THEOREM 4 Assume that conditions of Theorem 1(b) are satisfied, K is bounded, symmetric and Lipschitz continuous with Lipschitz constant L, f 1 is twice continuously differentiable with bounded derivatives, f 1 L1 (R) and f L (R). (a) If i=1 c i < then ( (log ) ) sup fˆ n 1/ n (x) f(x) =O P + bn x [a,b] nb n (b) If c i = L(i)i β for 1/ <β<1then ( (log ) ) sup fˆ n 1/ n (x) f(x) =O P + σ n x [a,b] nb n n + b n for any finite interval [a,b]. 4. Proofs The proofs of Theorems 1 4 proceed through the following lemmas. Let us stress that both variants of projection method turn out to be useful here. We apply bound (6) for fˆ n (x) Efˆ n (x) to prove Theorem 1(a). Decomposition (5) ˆ f n (x) E ˆ f n (x) = W n,0 + k>0 W n,k =: M n (x) + N n (x) is employed to prove Theorem 1(b). Martingale structure of M n (x) is used to establish its a.s. convergence to 0, whereas term N n (x) is analysed using the ergodic theorem. Furthermore, we take advantage of bound (6) again to find suitable approximation for N n (x) needed to prove Theorem. Exponential inequality for sums of martingale differences M n (x) is used to prove Theorem 3, whereas the Burkholder inequality is employed to study the term N n (x). LEMMA 1 Assume that assumptions of Theorem 1(a) are satisfied. Then Var ˆ f n (x) 8C 1 η 1 n where C 1 = A K(s) dds. ( ) c t k + n Var K b n (x X 1 ), (11) Observe that the bound in equation (11) can be written as C Var( X 1 + + X n )/n) + D Var fˆ n 0(x), where X t = i=0 c i η t i and fˆ n 0 (x) is a kernel density estimate based on i.i.d. sample of size n having density f. This follows upon noting that X 1 + + X n = n n c t k η k. Only the first term in the bound is influenced by dependence of underlying sequence (X t ). LEMMA Let f 1 be a bounded density, K (s) ds < and nb n / log n. Then M n (x) = Y tn E(Y tn F t 1 ) 0 completely.

Density estimation for linear processes 17 LEMMA 3 Assume that conditions of Theorem are satisfied. Then N n (x) + K b f (x)n 1 ( ) n X t,t 1 = O P, n where N n (x) = ˆ f n (x) E ˆ f n (x) M n (x) and X t,s = E(X t F S ). Let S n = n Xt. LEMMA 4 Assume that E( η 1 p )< for some p 1. Then E S n p = O((Var S n ) p ). Proof of Theorem 1(a). Writing X t = X t,t 1 + η t, it is easy to see that continuity of f 1 implies continuity of f, and thus in view of the Bochner lemma Var K b (x X 1 ) f(x)b 1 K (s) ds for b 0 whence the second term on the right-hand side of Eq. (11) tends to 0. Moreover, 1 n ( ) c t k = 1 n n 1 t,t n i=0 j=0 c t k c t k c j c j+i cj j=0 1/ n i=0 cj+i j=0 1/ 0 as j=0 c j <. Thus in view of equation (11), Var fˆ n (x) 0 and weak consistency of fˆ n (x) follows as Efˆ n (x) f(x)in view of continuity of f at x. Proof of Theorem 1(b) Observe that in view of Lemma it is enough to prove that N n (x) 0 a.s. Note that N n (x) = K(v)n 1 f 1 (x X t,t 1 b n v) dv Efˆ n(x). Observe that N n (x) K(v)n 1 (f 1 (x X t,t 1 b n v) f 1 (x X t,t 1 )) dv + f 1 (x X t,t 1 ) f(x) + E fˆ n (x) f(x). (1) n 1 The last term on the right-hand side tends to 0 in view of continuity of f.as(f 1 (x X t,t 1 ) is ergodic as an instantaneous transformation of an ergodic sequence (X t,t 1 ), the second term also tends to 0a.s. in view of ergodic theorem as Ef 1 (x X t,t 1 ) = f(x). The first term is bounded by Ab n K(v)v dv 0 using the fact that f1 is Lipschitz. Proof of Theorem Observe that in view of Lemmas and 3 we have that fˆ n (x) Efˆ n (x) M n (x) + K b f (x)n 1 ( ) n X t,t 1 = O P n

18 A. Bryk and J. Mielniczuk and M n (x) 0 completely. In order to establish the bound on n, observe that n = O(n n + i=n+1 ( n+i i ) ) and in view of the Karamata theorem θ i = O(i 1/ β L (i)). Thus, we have ( n+i i ) = i=n+1 i=n+1 O((nθ i ) ) = n i=n+1 O(i 1 4β L 4 (i)) = O(n 4 4β L 4 (n)). We now prove that for β>3/4 and Eη1 4 <, n 1 n X t,t 1 tend to 0 completely.applying Lemma 4 with X t,t 1 in place of X t,s n = n X t,t 1 and p =, we get P ( S n n ) ε ES 4 n (nε) = O 4 ( (Var S n ) Observe that reasoning analogously to proof of Theorem 1(a), we get Var S n = O(n) when c is are absolutely summable, and as discussed in section 1 Var S n n (β 1) L (n) for 1/ < β<1and whence for β>3/4,(var S n ) /n 4 is summable. Thus for such cases P( S n /n ε) is summable and conclusion follows from the Borel Cantelli lemma. The case β 1 is treated analogously. n 4 ). Proof of Lemma 1 In order to prove inequality (11) we use bound (6). Let U t = K bn (x X t ) EK bn (x X t ) and note that P k U t = 0 for t<k. Thus n Var ˆ f n (x) ( P 1 U t k+1 I{t >k}+ P 1 U 1 I{t = k} ( P 1 U t k+1 ) + n P 1 U 1. t>k As P 1 U 1 = EU1 E(E(U 1 F 0 )) EU1, it is enough to prove for t>1 ) P 1 U t C 1 ( c t 1 (Eη 1 )1/ + c t 1 E η 1 ) C 1 c t 1 η 1. To this end, observe that writing X t = i=t s (t s) 1 c i η t i + c i η t i =: X t,s + R t,s, i=0 we have for t>1 P i U t = K bn (x z X t,1 )f t 1 (z) dz K bn (x z X t,0 )f t (z) dz, (13) where f t s ( ) denotes a density of R t,s. Moreover, an easy induction argument implies that f t is Lipschitz continuous with Lipschitz constant A when f 1 has this property. Note that

Density estimation for linear processes 19 equation (13) can be written as K(z){f t 1 (x X t,1 b n z) f t (x X t,1 b n z)} dz + K(z){f t (x X t,1 b n z) f t (x X t,0 b n z)} dz =: I + II. Using Lipschitz continuity of f t,wehave II C 1 c t 1 η 1. Moreover, as f t (s) can be written as convolution f t 1 f(s), where f is a density of c t 1 η 1, we have that for s = x X t,1 b n z I K(z) f t 1 (s) f t 1 (s y) f(y)dy dz A K(z) y f(y)dy dz = C 1 E c t 1 η 1. Proof of Lemma We use a special case of Freedman exponential inequality [15] stating that for sum S n = nm n (x) of bounded martingale differences T i, T i B,wehaveforλ>0 E exp(λs n ) exp(βb e(bλ)), where e(λ) = e λ λ 1 and β is a bound of conditional variances of T i P( n i=1 E(T i F i 1) β) = 1. In our case such that E(T i F i 1) = E(K b n (x X i ) F i 1 ) (E(K bn (x X i ) F i 1 )) E(Kb n (x X i ) F i 1 ) = 1 ( ) x K Xi,i 1 s f bn 1 (s) ds b n K (s) ds sup f 1 b n and B = sup K/b n. Thus n i=1 E(T 1 F i 1) nc/b n for some positive C, and for λ>0 P(M n ε) E exp(λs { n) (CB } exp(nλε) exp ne(bλ)) nλε. (14) b n Let λ n = ε n b n /C, where ε n = (8C log n/nb n ) 1/. Note that as e(λ) 1 λ for λ 0we have that e(bλ n )<(3/4)(Bλ n ) for sufficiently large n. Thus, for such n, the last bound is smaller than exp(6 log n 8 log n) = exp( log n). (15) Thus the Borel Cantelli lemma implies that M n (x) (8C log n/nb n ) 1/ finitely often almost surely. Analogously, we show that n P(M n(x) ε n )< and whence M n (x) 0 completely. Proof of Lemma 3 The proof uses the main ideas of the proof of Lemma 4 by Mielniczuk and Wu (004). Observe that nn n (x) = n K b T t (x), where T t (x) = f 1 (x X t,t 1 ) f(x),

130 A. Bryk and J. Mielniczuk where denotes convolution. We will prove that P 1 K b T t (x) + K b f (x)c t 1 η 1 =O( c t 1 C 1/ t 1 ), (16) where C t = i=t c i. This will prove the lemma in view of equation (7). Writing K b (x s)ξ s ds ( = E K b (x s)k b (x s )ξ s ξ s ds ds ) for any family of random variables (ξ s ) s R and using Cauchy inequality, we have K b (x s)ξ s ds sup ξ s. s Thus, in order to prove equation (16) in view of P 1 T t (x) = f t 1 (x X t,1 ) f t (x X t,0 ),it is enough to show that sup s f t 1 (s X t,1 ) f t (s X t,0 ) + f (s)c t 1 η 1 =O( c t 1 C 1/ t 1 ). (17) In view of Lemma 1 by Mielniczuk and Wu [17], f (x) = Ef t 1 (x X t,1). Thus sup f (s) f t 1 s as sup s,t f t () (s) <. Moreover Thus equation (17) follows from (s) sup s E f t 1 (s X t,1) f t 1 (s) sup f t 1 (s X t,1) f t 1 (s) =O(C1/ t 1 ) s sup f t 1 (s) f t 1 (s X t,0) =O(C 1/ t ). s sup f t 1 (s X t,1 ) f t (s X t,0 ) + f t 1 (s X t,0)c t 1 η 1 =O( c t 1 ). (18) s Let η 1 be a copy of η 1 independent of (η i ) i= and X t,1 = X t,1 c t 1 η 1 + c t 1 η 1. Random variable on the left-hand side of equation (18) can be written as Adding to equation (19) E(f t 1 (s X t,1 ) f t 1 (s X t,1 ) + f t 1 (s X t,0)c t 1 η 1 F 1 ) (19) E(f t 1 (s X t,0 ) f t 1 (s X t,0 ) f t 1 (s X t,0)c t 1 η 1 F 1) = 0, we see that equation (18) follows from two term Taylor expansion of f t 1 (s X t,0 ) f t 1 (s X t,1 ) and f t 1 (s X t,0 ) f t 1 (s Xt,1 ).

Density estimation for linear processes 131 Proof of Lemma 4 S n = n n i= c t iη i =: n i= d n iη i is a sum of martingale differences as η i is independent. Thus, the Burkholder inequality implies ( ) p E S n p C p E (d n i η i ) i= and in view of the Minkowski inequality, we have S n p C1/p p = C 1/p p n i η i ) i= (d p i= C 1/p p i= (d n i η i ) p d n i η i p = C1/p p (Var S n) η 1 p, from which the statement of the lemma follows. Proof of Theorem 3 (a) Proof of Lemma indicates that M n (x) = O((log n/nb n ) 1/ ) a.s. As Efˆ n (x) f(x)= O(bn ), it is enough to show that N n(x) = O((log n/n) 1/ + bn ) a.s., provided c i s are absolutely summable. We bound N n (x) as in proof of Theorem 1(b) and note that the first and the third terms of the bound are O(bn ). Let H n (x) = f 1 (x X t,t 1 ) f(x)=: T t (x). We first show that P 1 T t (x) p = O( c t 1 ). (0) To this end, consider a coupled version of X t,t 1 : X t,t 1 = X t,t 1 c t 1 η 1 + c t 1 η 1, where η 1 has the same distribution as η 1 and is independent of (η i ). Then E(f 1 (x X t,t 1 ) F 0 ) = E(f 1 (x X t,t 1 ) F 1), and as f 1 satisfies Lipschitz condition we have P 1 T t (x) p = E((f 1 (x X t,t 1 ) f 1 (x X t,t 1 )) F 1 p E(A c t 1 (η 1 η 1 ) ) F 1 p A c t 1 (η 1 η 1 ) p = O( c t 1 ) in view of E η 1 p <. Consider decomposition of H n (x) = n j= P j H n (x) into martingale differences P j H n (x). It follows from the Burkholder inequality that E H n (x) p C p E (P j H n (x)) j= p/ and, moreover, P j H n (x) p P 1 T t j+1 (x) p C c t j. t=j t=j

13 A. Bryk and J. Mielniczuk Thus (E H n (x) p ) /p C (P j H n (x)) j= p/ C C j= j= P j H n (x) p c t j. (1) As c i s are absolutely summable we get E H n (x) p n p/. We apply Moricz [18] inequality stating that if for some non-negative a i and r>0, q>1, E X 1 + +X i r ( i j=1 a j ) q, then E(max i n X 1 + +X i r ) C rq ( n j=1 a j ) q with r = p and q = p/. This together with equation () implies that E(max i n H i (x) p ) n p/. Thus H n (x) = O(n 1/ (log n) 1/ ) a.s. It follows, by the Borel Cantelli lemma from the observation, that P( max H i (x) ɛ( k k) 1/ ) E max H i k i (x) p = O(k p/ ) i k ( kp/ k p/ ) which is summable as p>. (b) Proof of (b) follows the same argument after noting that the analogue of equation (1) is t=j (E H n (x) p ) /p = O(σ n ). Moreover, two-fold application of the Karamata theorem yields that the Moricz inequality may be applied with a i = i β L (i). It yields E max i n H i (x) p Cσn p. Proof of Theorem 4 We first prove that ( log n sup M n (x) =O x [a,b] nb n almost surely. Let ɛ n = 3(16C log n/nb n ) 1/, where C is defined in the proof of Lemma and a = x 0 <x 1 < <x ln = b is an equipartition of [a,b] with l n =[6L(b a)/(bn ɛ n)]+1. Define k(x) = k to be the point of partition closest to x. Observe that ( ) ( ( ) ) P sup M n (x) ɛ n P sup M n (x) M n (x k ) ɛ n x [a,b] x [a,b] 3 ( ( ) ) 1 + P max M n (x k ) ɛ n. 0 k l n 3 It is easy to see that in view of Lipschitz continuity of K, wehave M n (x) M n (x k ) L(b a)/(bn l n) ɛ n /3 and whence the first probability on the right-hand side is 0. Second term is trivially bounded by ) 1/ (l n + 1) max 0 k l n P( M n (x k ) 3 1 ɛ n ). () Choosing λ n = ɛ n b n /3C and reasoning as in proof of Lemma, we get that equation () is bounded by (l n + 1) exp( 4 log n) (n/b 3 n log n)1/ exp( 4 log n) which is summable, and

Density estimation for linear processes 133 the conclusion follows from the Borel Cantelli lemma. Next, we show that in the case of (b) sup x [a,b] N n (x) =O P (σ n /n + bn ). Reasoning in case (a) is similar. Bounding N n(x) as in proof of Theorem 1(b), we see that the first and the third terms of the bound are bn uniformly in x [a,b]. In order to bound the second term let, as in the proof of Theorem 3, H n (x) = n f 1(x X t,t 1 ) f(x) and h n (x) = H n (x). From the proof of Lemma 5 in Wu [9], it follows that f 1 L1 (R) provided f () L (R), and if f 1 is bounded we have ( t ) (x) dx = O c t j = O(σn ). Eh n R j= Moreover, observe that ( b E sup H n (x) H n (a) E h n (x) dx) (b a) x [a,b] a b a Eh n (x) dx. The last two equations imply via the Markov inequality that sup x [a,b] H n (x) H n (a) = O P (σ n ). Proof of Theorem 3(a) implies that H n (a) =O(σ n ) and the conclusion follows via triangle inequality. Acknowledgements We thank Wei Biao Wu for his insightful comments on handling the term H n (x) in the proof of Theorem 4 and a referee for helpful remarks. References [1] Bosq, D., 1998, Nonparametric Statistics for Stochastic Processes. Estimation and Prediction. Vol. 110, Lecture Notes in Statistics (New York: Springer). [] Hall, P. and Hart, J.D., 1990, Convergence rates in density estimation for data from infinite-order moving average process. Probability Theory and Related Fields, 87, 53 74. [3] Ho, H.C. and Hsing, T., 1996, On asymptotic expansion of the empirical process of long-memory moving averages. Annals of Statistics, 4, 99 104. [4] Csörgő, S. and Mielniczuk, J., 1999, Random design regression under long-range dependent errors. Bernoulli, 5, 09 4. [5] Wu, W.B. and Mielniczuk, J., 00, Kernel density estimation for linear processes. Annals of Statistics, 30, 1441 1459. [6] Wu, W.B., 001, Nonparametric estimation for stationary processes. PhD thesis, University of Michigan, available at http://www.stat.uchicago.edu/research/techreports.html. [7] Wu, W.B., 00, Central limit theorems for functionals of linear processes. Statistica Sinica, 1, 635 649. [8] Wu, W.B., 003a, Empirical processes of long-memory sequences. Bernoulli, 9, 809 831. [9] Wu, W.B., 003b, On the Bahadur representation of sample quantiles for dependent sequences, preprint. [10] Fan, J. andyao, Q., 003, Nonlinear Time Series. Nonparametric and Parametric Methods (NewYork: Springer). [11] Chanda, K., 1974, Strong mixing properties of linear stochastic processes. Journal of Applied Probability, 11, 401 408. [1] Gorodetskii, V.V., 1977, On the strong mixing property for linear sequences. Theory of Probability and its Applications,, 411 41. [13] Withers, C.S., 1981, Conditions for linear process to be strongly mixing. Zeitschrift fur Wahrscheinlichkeits Theorie und Verwandte Gebiete, 57, 477 480. [14] Pham, T.D. and Tran, L.T., 1985, Some mixing properties of time series models. Stochastic Process and their Applications, 19, 97 303. [15] Freedman, D., 1975, On tail probabilities for martingales. Annals of Probability, 3, 100 118. [16] Gordin, M.I. and Lifszic, B., 1978, The central limit theorem for stationary Markov processes. Doklady Akademii Nauk SSSR, 19, 39 394. [17] Mielniczuk, J. and Wu, W.B., 004, On random-design model with dependent errors. Statistica Sinica, to appear. [18] Moricz, F., 1976, Moment inequalities and the strong laws of large numbers, Zeitschrift fur Wahrscheinlichkeits Theorie und Verwandte Gebiete, 35, 99 314.