Nonparametric regression with martingale increment errors

Size: px

Start display at page:

Download "Nonparametric regression with martingale increment errors"

Martin Foster
5 years ago
Views:

1 S. Gaïffas (LSTA - Paris 6) joint work with S. Delattre (LPMA - Paris 7) work in progress

2 Motivations Some facts: Theoretical study of statistical algorithms requires stationary and ergodicity. Concentration inequalities for finite sample results. These are standard tools when the data is assumed i.i.d.: Bernstein s or Talagrand s inequality are quite popular in statistics Beyond independence, it is also standard to use a mixing assumption, like β-mixing: it allows to get back independence using coupling (the well-known Berbee s lemma), so that roughly, the independence tools can be used again (on blocks). approach adopted in many papers Problem: stationarity and mixing are hard to verify on data. Moreover, under a mixing assumption, the statistical procedure often depends on the mixing coefficients, but they cannot be estimated! The aim of this work is to study well-known statistical procedures (kernel estimation, Lepski s method), without stationarity and ergodicity assumptions Idea: We replace such assumptions by an assumption on the structure of the model: we consider a regression model where the noise is a martingale increment.

3 A model: regression with martingale increment errors (X k ) k 0 and (Y k ) k 1 are (F k ) k 0 -adapted sequences of real random variables, such that: Y k = f (X k 1 ) + ε k, (1) where (ε k ) k 0 is a F k -martingale increment: E( ε k F k 1 ) < and E(ε k F k 1 ) = 0, where f : R R is the unknown function of interest. Assumption: There is a F k -adapted sequence (σ k ) k 0 > 0 such that: k 0, where µ, γ > 0. We observe (σ k ) k 0. [ ( ) ] E exp µ ε2 k F σk 1 2 k 1 γ, (2) Goal: Estimate f at a point x R based on observations (Y 1,..., Y N ) and (X 0,..., X N 1 ), where N 1 is a stopping time.

4 A model: regression with martingale increment errors Particular cases Very particular cases: usual i.i.d. regression and autoregression models. Regression model: We observe (Y k, X k 1 ) N k=1 satisfying Y k = f (X k 1 ) + v(x k 1 )ζ k, where (ζ k ) is i.i.d., centered and subgaussian, and independent of F k = σ(x 0,..., X k ). Autoregression model: We observe (X k ) N k=0 satisfying X k = f (X k 1 ) + v(x k 1 )ζ k, where (ζ k ) is i.i.d., centered, independent of X 0 and subgaussian.

5 A model: regression with martingale increment errors Remarks Remark 1 (stopping time) We observe (X k 1, Y k ) N k=1 with N is a stopping time: non asymptotic results online setting: the statistician decides to stop the sampling according to some rule (clinical trials,...) Remark 2 (variance) The variance process (σ k ) is observed. We can use a two-steps Lepski s procedure to estimate the conditional variance. While ε k /σ k is conditionaly subgaussian, ε k is not in general!! Think of a GARCH model (ε k, σ k ) for instance: even if (ζ k ) is Gaussian i.i.d., ε k = σ k ζ k can have heavy tails.

6 The Lepski s method Some preliminary definitions Important object: for h > 0, take L(h) = N 1 1 σ 2 Xk 1 x h k=1 k 1 = occupation time of (X k ) k 0 at x renormalized by (σ k ). If L(h) > 0, define the kernel estimator ˆf (h) = 1 L(h) Consider the set of bandwidths N 1 σ 2 k=1 k 1 1 Xi 1 x hy k. H := {h j : L(h j ) > 0} where h j = h 0q j for some parameters h 0 > 0 and q (0, 1) [other choices are possible].

7 The Lepski s method Definition Define for some b > 0: For u > 0, define on { L(h 0) 1/2 u } : ψ(h) := 1 + b log(h 0/h). { H u = min h H : ( ψ(h) ) 1/2 } u, L(h) and let u 0 > 0. On {L(h 0) 1/2 u 0}, we select Ĥ according to the following standard Lepski s rule (Lepski (1992),...): where ν > 0. { Ĥ := max h H : h H u0 and such that h [H u0, h] H, ˆf (h) ˆf (h ) ν ( ψ(h ) ) 1/2 }, L(h )

8 An adaptive upper bound Preliminaries Remark. u 0 > 0 is such that L(h 0) 1/2 u 0. This is very mild: if there is some data close to x, and if h 0 is large enough, then L(h 0) should be larger than some constant. [We don t have ergodicity, so we don t know if L(h) is close to EL(h) np X ([x h, x + h])...] We want to prove an oracle result: we prove that ˆf (Ĥ) has the same rate of convergence as ˆf (H ), where H is the oracle bandwidth. Consider W (bias condition) such that sup h [H u0,h] H N i=1 f (X i 1)/σi 11 2 Xi 1 x h N i=1 1/σ2 i 1 1 X i 1 x h f (x) W (h). (OK if f is Hölder for instance...). Nothing is required on W, but we need to bound it from below and above: where ε 0, α 0 > 0. W (h) := (ε 0(h/h 0) α 0 ) W (h) u 0,

9 An adaptive upper bound On the event { L(h 0) 1/2 W (h 0) }, define the oracle bandwidth { H := min h H : Denote P (A) = P(A Ω ), where ( ψ(h) ) 1/2 W (h)}. L(h) Ω := { L(h 0) 1/2 W (h 0), W (H ) u 0 }. Ω is a minimal technical requirement for the proof of the adaptive upper bound. Theorem (Adaptive upper bound) Grant (2) (ε k /σ k is subgaussian cond. on F k 1 ) and let ˆf (Ĥ) be the procedure given by the Lepski s rule. Then, we have for any t > 0: P [ W (H ) 1 ˆf (Ĥ) f (x) > t ] c(1 + log(1 + t))t bλν2 /(33α 0 ), where c is a constant which depends on λ, µ, γ, q, b, u 0, ε 0, α 0, ν.

10 An adaptive upper bound Remarks Don t need stationarity or ergodicity for the proof. BUT, without further assumption, we cannot give the behaviour of H (and of W (H )): it does not necessarily go to 0, when (X k ) is transient for instance. This is why we needed to introduce W. Cornerstone of the proof: a new result concerning the stability of self-normalized martingales Only true assumptions: martingale difference structure of ε k and ε k /σ k is subgaussian cond. on F k 1. Indeed, we also require that L(h 0) 1/2 W (h 0) and W (H ) u 0. But, this is no more a restriction when f is Hölder for instance, since in this case W (h) = Lh s. We use kernel estimation with K(x) = 1 [ 1,1] (x)/2, for technical simplicity we can obtain adaptation for an Hölder exponent s smaller than 1 only. One can consider the Lepski s method applied to local polynomials.

11 A tool: stability for self-normalized martingales Let (M n) n 0 be a locally square integrale (G n)-martingale. The predictable quadratic variation of M n is where M n := M n M n 1. M n := n E( Mk 2 G k 1 ), k=1 A standard concentration inequalities is Freedman s inequality: if (M n) n 0 is such that M n c a.s., then for any x, y > 0: ( x 2 ) P[M n x, M n y] exp. 2(y + cx) One can use the Bernstein s condition instead of M k c: n k=1 see Pinelis (1994), de la Peña (1999). E [ M k p G k 1 ] p! 2 cp 2 M n, p 2,

12 A tool: stability for self-normalized martingales Problem: a Freedman s type inequality (or Bernstein s inequality) is not enough to prove the adaptive upper bound: in Freedman s inequality, we must work on { M n y} this requires an ad-hoc assumption on the (X k ), such as some mixing property, that we wanted to avoid. First idea: give a deviation for M T / M T. But it is well-known that in general M T / M T is not even tight!! [Simple counter-example: consider M n = B n where (B t) standard Brownian motion, and define T c = inf{n 1 : B n/ n c}, where c > 0. For any c > 0, T c is finite a.s. So, one has M Tc / M Tc = M Tc / T c c, for any c > 0!!] A simple solution: replace M T M T by a MT (a + M T ). We prove that for any a > 0, a M T /(a + M T ) is sub-gaussian when M n is sub-gaussian, hence the name stability.

13 A tool: stability for self-normalized martingales Theorem (stability) Assume that M n = s n 1ζ n where (ζ n) is a sequence of (G n)-martingale increments such that for some µ > 0: E [ exp(µζ 2 n) G n 1 ] γ for any n 1, and where (s n) n N is a (G n)-adapted sequence of non-negative random variables. Let us define n V n := sk 1. 2 Then, for any λ [0, with c λ := exp ( λγ λ 2(1 2λΓ λ ) k=1 µ ), any a > 0 and any stopping-time T, we have: 2(1+γ) [ ( E exp λ amt 2 )] (a + V T ) 2 ) (exp(λγλ ) 1) and Γ λ := 1+2γ 2(µ λ). 1 + c λ, (3)

14 A tool: stability for self-normalized martingales A similar result holds for sub-exponential martingale increments This theorem shows that when ζ k is sub-gaussian (resp. sub-exponential) conditionally to G k 1, then a M T /(a + V T ) is also sub-gaussian (resp. sub-exponential), hence the name stability. No concentration of the measure phenomenon here: we only prove tat the tails of a M T /(a + V T ) are the same as that of ζ n (actually M n can be equal to ζ n, take s n 1 = 1 and s k = 0 if k n 1). It is tempting to take a = V T to mimize the exponential moment, but not possible!! (again, M T / M T is not even tight in general)

15 A link with the usual minimax theory Some preliminaries Aim: Under a dependence assumption for (X k ), find a deterministic equivalent to our random rate of convergence. This makes our theory consistent with the usual minimax theory of deterministic rates. To measure the dependence of (X k ) k 0, we can use β-mixing (see Kolmogorov and Rozanov (1960), and see Doukhan (1994) for topics on dependence). Introduce Xu v = σ(x k : u k v), where u, k, v N. We say that a strictly stationary sequence (X k ) k Z is β-mixing if β q := 1 2 sup ( I i=1 J ) P[U i V j ] P[U i ]P[V j ] 0 as q +, j=1 where sup is taken among all finite partitions (U i ) I i=1 and (V j ) J j=1 that are X 0 and X q + measurable. β-mixing has been used a lot (in statistics) mainly because of a coupling result by Berbee, see Berbee (1979), that allows to get back independence on blocks, on which one can use Bernstein s or Talagrand s inequality.

16 A link with the usual minimax theory Some preliminaries Assumptions The sequence (σ k ) k 0 is equal to a known constant σ (X k ) k 0 is a strictly stationary process We observe (X k 1, Y k ) n k=0 (stopping time N n). We assume that f has Hölder-type smoothness in a neighbourhood of x. Let us fix two constants κ, τ > 0. Assumption (Smoothness of f ) There are η > 0, 0 < s 1 and a slowly varying function l w such that sup f (y) f (x) w(h), where w(h) := h s l w (h) y: y x h for any h η, and such that w is increasing on [0, η], w(h) τh 2 and w(h) κ for any h [0, η]. Slightly more general than Hölder smoothness, where: l w r with r > 0.

17 A link with the usual minimax theory Some preliminaries Under these assumptions, we can consider { H w := min h > 0 : ( ψ(h) ) 1/2 } w(h) L(h) = optimal bandwidth associated to the modulus of continuity w. We can prove use the adaptive upper bound with W (h) = w(h), and the random rate is then w(h w ) for any f with modulus of continuity w. Idea: under a β-mixing assumption, we can say how L(h) concentrates around its expectation EL(h) (Bernstein s inequality), so a natural determistic equivalent to H w is { ( ψ(h) ) 1/2 } h w := min h > 0 : w(h). EL(h)

18 A link with the usual minimax theory Some preliminaries Easy to give the behaviour of h w under the following assumption on P X (using properties of regularly varying functions): Assumption (Local behaviour of P X ) There is η > 0 and γ 1 such for any h [0, η], we have where l X is slowly varying. P X ([x h, x + h]) = h γ+1 l X (h), This is an extension of the case where P X has a continuous density f X wrt Lebesgue measure such that f X (x) > 0. It is met when f X (y) = c y x γ for y close to x for instance. Lemma: We can write h w = n 1/(2s+γ+1) l 1(1/n) and w(h w ) = n s/(2s+γ+1) l 2(1/n) when n is large enough, where l 1 and l 2 are slowly varying functions.

19 A link with the usual minimax theory Proposition (deterministic equivelent) Grant the previous Assumptions on f and P X. If (X k ) is geometrically β-mixing, namely if there are b, c > 0 such that for any q 1: then we have: [ w(hw ) P 4 β q exp( (q/b) 1/c ), ] w(h w ) 4w(h w ) 1 exp( C 1n δ l 2(1/n)), where δ = 2s (2s + γ + 1)(c + 1) for n is large enough, where C 1 > 0 is a constant and l 1 is a slowly varying function, that depends on b, c, γ, s, σ and l X, l w. A similar result holds for arithmetical mixing, when β q (b/q) 1/c with c < 2s/(γ + 1).

20 A link with the usual minimax theory In the following simple nonparametric regregression setting: Then: f is s-hölder (w(h) = Lh s so l w (h) L) P X has a density f X which is finite and bounded away from zero at x (γ = 0) (X k ) k 0 is geometrically β-mixing, or arithmetically β-mixing with β q (b/q) 1/c for c < 2s. w(h w ) has the same order as [(log n)/n] s/(2s+1) with a large probability. Note that [(log n)/n] s/(2s+1) is the pointwise adaptive minimax rate in this case. The main theorem is consistent with the usual minimax theory of deterministic rates, in ergodic situations.

21 Conclusion The message of this work is twofold: The kernel estimator and Lepski s method are very robust to the statistical properties of the model: no need for ergodicity to be almost optimal For the theoretical assessment of an estimator, one can develop a theory involving random rates, that depends on the occupation time. The rate is almost observable, if the smoothness of f were known... (confidence bands?) Ergodicity shall be only used in a second step of this theory, to derive the asymptotic behaviour of the random rate. Open question: is W (H ) optimal? Applications in econometry Learning theory beyong the usual i.i.d. regression setting

22 Bibliography Berbee, H. C. P. (1979). Random walks with stationary increments and renewal theory, vol. 112 of Mathematical Centre Tracts. Mathematisch Centrum, Amsterdam. de la Peña, V. H. (1999). A general class of exponential inequalities for martingales and ratios. Ann. Probab., Doukhan, P. (1994). Mixing, vol. 85 of Lecture Notes in Statistics. Springer-Verlag, New York. Properties and examples. Kolmogorov, A. N. and Rozanov, J. A. (1960). On a strong mixing condition for stationary Gaussian processes. Teor. Verojatnost. i Primenen., Lepski, O. V. (1992). On problems of adaptive estimation in white gaussian noise. Advances in Soviet Mathematics, Pinelis, I. (1994). Optimum bounds for the distributions of martingales in Banach spaces. Ann. Probab.,

Habilitation à diriger les recherches. l Université Pierre et Marie Curie. Some contributions to statistics and machine learning

Habilitation à diriger les recherches Mention Mathématiques et Applications 1 Présentée devant l Université Pierre et Marie Curie Par Stéphane Gaïffas Ecole polytechnique CMAP Some contributions to statistics