Simultaneous White Noise Models and Optimal Recovery of Functional Data. Mark Koudstaal

Size: px

Start display at page:

Download "Simultaneous White Noise Models and Optimal Recovery of Functional Data. Mark Koudstaal"

Jeffry Hopkins
5 years ago
Views:

1 Simultaneous White Noise Models and Optimal Recovery of Functional Data by Mark Koudstaal A thesis submitted in conformity with the requirements for the degree of Doctor of Philosophy Graduate Department of Statistics University of Toronto c Copyright 2015 by Mark Koudstaal

2 Abstract Simultaneous White Noise Models and Optimal Recovery of Functional Data Mark Koudstaal Doctor of Philosophy Graduate Department of Statistics University of Toronto 2015 We consider i.i.d. realizations of a Gaussian process on [0, 1] satisfying prescribed regularity conditions. The data consist of discrete samplings of these realizations in i.i.d. Gaussian noise and the goal is estimation of the underlying trajectories. Further, we want our estimates to enjoy expected L 2 errors, conditioned on the realized trajectories, which attain optimal rates. Under general conditions on both design and process, an asymptotic equivalence, in Le Cam s sense, is established between an experiment which simultaneously describes these realizations and a collection of white noise models. The risk properties of our estimation goal may then be studied in an idealized setting and benchmarks established for practically implementable procedures. In this context, the white noise models are projected onto a basis satisfying general conditions in relation to the covariance kernel of the process which generated the data. This reduces the problem of initial interest to that of recovering a collection of normal means in euclidean norm, the means of interest having Gaussian structure. A variant of Stein estimation is applied for recovery of these means and a key inequality derived showing that the corresponding risks, conditioned on the underlying means, can be made arbitrarily close to those that an oracle with knowledge of the process would attain. This establishes various notions of optimality for our recovery procedure. ii

3 Finally, guarantees are derived for practically implementable variants and empirical performance is illustrated through simulated and real data examples. iii

4 Acknowledgements This thesis owes its existence to a fantastic network of friends and teachers. I m incredibly grateful to have had such a thoughtful and supportive advisor in Fang who devoted a huge amount of time to the development of this work with patient, enthusiastic and encouraging scientific guidance. He has been generous with and confident in me from the outset and I will always value the gift of being given the confidence to take this project in directions that caught my interest. My heartfelt thanks go to my advisory committee for taking the time to listen to rough progress reports on this work, for setting me at ease in committee meetings with kind engaged, thoughtful listening and for encouraging suggestions that have improved and diversified the scope of things. Special thanks to Andrey for generously nurturing my interest in Wavelets! Words can t express my thanks for my parents. They ve made everything possible, been fantastic friends along the way and I m lucky to have them in my life. A champions medal goes to Ali for sharing in the ups and downs of the whole process and the uncertainty of it all. I m so lucky to have such a great friend in my partner. iv

5 Contents 1 Introduction 3 2 White Noise Equivalence for Functional Data Le Cam Equivalence White Noise Equivalence for Nonparametric Data Functional Data and White Noise Representations Stein Estimation and Oracle Inequalities 19 4 Minimax Recovery of Functional Data 28 5 Implementable Procedures with Rigorous Guarantees Estimation of σ From Quantiles Extension to Imperfect Data Models Guarantees for Recovery on Equispaced Design Suggested Directions for Random Design Numerical Experiments Simulation studies Recovery of Normalized Stock Prices Appendix 63 v

6 Bibliography 78 vi

7 List of Tables 6.1 In Sample Recovery. Results are blown up by a factor of 10 3 to preserve space Out of Sample Recovery. Results are blown up by a factor of 10 3 to preserve space Errors calculated on training data. Wavelet and Fourier estimates are those corresponding to the minimum quantile estimates. Results are blown up by a factor of 10 4 to preserve space vii

8 List of Figures 6.1 This plot compares the oracle threshold weights λ k /(λ k + τ 2 ) (blue), for covariance function C 4, with the empirically estimated threshold weights α mn,k for σ (red) and ˆσ (green). As can be seen from the graph, both sets of empirical weights are more conservative than the oracle. This results directly from the fact that the inflation factors (1 + Cδ) used in the α mn,k guarantee that, with high probability, it holds that α mn,k λ k /(λ k + τ 2 ) simultaneously for k = 1,..., m The blue line plots recovery error for the estimate ˆf(α), which uses ˆσ 2 (α) to construct α mn,k, vs α. There red line is the error corresponding to the estimate ˆf(1/m), which lies below the entire error curve suggesting that this value is optimal This plot corresponds to a sample function plotted alongside corresponding oracle and ˆσ 2 (1/m) estimates in wavelet bases This plot corresponds to a sample function plotted alongside corresponding oracle and ˆσ 2 (1/m) estimates in wavelet bases viii

9 Chapter 1 Introduction Functional Data Analysis (FDA) is maturing to encompass a host of techniques and applications in nonparametric statistics. The monographs of Ramsay and Silverman [35, 36] provide overviews and numerous examples. We begin with a setup drawn from FDA: we assume we have a collection of functions f 1,..., f n i.i.d. f, a mean 0 Gaussian distribution over a sufficiently regular function class F L 2 [0, 1] and satisfying E f 2 <. The f i are observed intermittently and with noise so that our data consist of y ij = f i (x ij ) + z ij, z ij i.i.d. N(0, 1) for i = 1,..., n, j = 1,..., m (1.0.1) for design points x ij [0, 1]. Our interest is in modelling the risks of recovering the f i from The covariance kernel, C(s, t) = Ef(s)f(t), induces an integral operator C : L 2 [0, 1] L 2 [0, 1] through the action Cg = 1 0 C(, s)g(s)ds 1

10 Chapter 1. Introduction 2 and the corresponding orthonormal eigenbasis, {ψ k } k=1, ordered with decreasing eigenvalues, is known as the Karhunen-Lòeve (KL) basis. This basis is known to be optimal for linear recovery in that the expected L 2 [0, 1] approximation errors, N E(N) = E f 2 f, ψ p ψ p = p>n Cψ p, ψ p, p=1 are smaller in the KL basis, for every N, than in any other orthonormal basis. The bulk of related literature focuses on using to estimate the eigenstructure of C and the prevailing philosophy is that recovery using quality estimates of this basis will be optimal. Additionally, other procedures of interest to FDA are often performed optimally in this basis. Under the assumptions of densely sampled data, a common approach is to pre-smoothe curves and use the resulting estimates to approximate the covariance structure. This can then be used for expansion of the underlying process. This type of approach has been analyzed in Cardot [12] under the assumptions of a finite dimensional underlying process and equispaced sampling. The PACE procedure, put forward in Yao et al. [45], introduced a clever smoothing procedure for estimation of the eigenstructure of C and proposed a novel conditional expectation approach for recovery of the f i in the estimated basis. Further, this procedure enjoys robustness against sparse and irregularly sampled data common to longitudinal studies. Smoothness and the operation of sampling are naturally expressed in the language of Reproducing Kernel Hilbert Space (RKHS) and this structure has recently been exploited to develop unique procedures relevant to our problem. In Cai and Yuan [7] a smoothing spline approach is taken for recovery of C from 1.0.1, under relaxed assumptions on the design points, and the procedure is shown to come within logarithmic factors of optimality under a matching of the smoothness of the process to to the kernel used for recovery. With a finite dimensional assumption of the un-

11 Chapter 1. Introduction 3 derlying process, the paper Amini and Wainwright [1] develop a novel method for extending recovery of C to noisy sampling schemes in a fairly general class of domains and prove optimality properties. Nevertheless, the primary focus of these techniques is recovery of C. Further, beyond the finite dimentional setting and the notion of consistency, their performance and properties regarding recovery of the f i in more general settings remains unclear. Techniques from nonparametric function estimation might be brought to bear but treating as a collection of separate recovery problems would neglect potentially useful information from the distributional structure of the data. The complicated nature of the data obscure this intuition, and we would like to study the problem in a simplified setting with potential for broad insights. A useful tool, allowing the simplification of statistical problems, is Le Cam s notion of asymptotic equivalence [29]. It is often the case that a complex statistical model, over a given parameter space Θ, is asymptotically equivalent to a simpler model over Θ. In this case both models share identical rates under all bounded loss functions and one may reduce a study of the complicated model to a proof of asymptotic equivalence and a corresponding analysis of the simpler model. In particular, asymptotic equivalence allows one to address questions about complicated issues such as minimaxity in a simplified setting. The white noise model, Y (dt) = f(t)dt + m 1/2 B(dt), t [0, 1] (1.0.2) with B a standard brownian motion, has had a profound impact on the study of many problems in nonparametric function estimation through the notion of asymptotic equivalence. On the one hand work as in Brown and Low [3], Brown et al. [4] and Reiß [37] has established that this model is asymptotically equivalent to the classical

12 Chapter 1. Introduction 4 nonparametric regression experiment y j = f(x j ) + z j, z j i.i.d. N(0, 1), j = 1,..., m, (1.0.3) for various collections of functions F and sampling designs for x j [0, 1] which grow dense as m. On the other hand, given an orthonormal basis {ψ k } k=1 of L2 [0, 1] one may decompose the white noise model into a collection of normal means. Basic properties of Ito integration give Y (ψ k ) = 1 0 ψ k (t)y (dt) = θ k + m 1/2 z k, z k i.i.d. N(0, 1), k = 1, 2,... (1.0.4) with θ k = f, ψ k L 2. Further, given estimates ˆθ k of the θ k one may form an estimate ˆf of f from ˆf = k ˆθ k ψ k and isometry yields E ˆf f 2 L = E(ˆθ 2 k θ k ) 2. (1.0.5) k=1 In this way, we see that estimation in under the MISE metric may be translated via the white noise model into an infinite dimensional normal mean recovery problem under l 2 risk. Various collections of functions of practical interest, F, have natural characterizations in terms of geometric constraints on their Fourier coefficients θ k in an appropriate basis and reduction to can considerably simplify the problem of characterizing the minimax risk R m (F) = inf ˆf sup E ˆf f 2 L2. (1.0.6) f F See Pinsker [34], Donoho [19] and Donoho et al. [21, 22]. Further, construction of

13 Chapter 1. Introduction 5 estimators which attain the minimax rate sup E ˆf f 2 L R m(f) 2 f F over broad classes of functions, F, can be much simpler in the framework of and while providing indications on how to proceed in the general setting of See Candes [11], Cai [6] and Johnstone [26]. We propose that similar insights may be won in the context of FDA by employing the white noise model there. Let B i, i = 1,..., n, be independent Brownian motions. In the body of the paper we will justify studying risk in the recovery problem in the simplified context of the collection of white noise models Y i (dt) = f i (t)dt + m 1/2 B i (dt), for i = 1,..., n, t [0, 1], (1.0.7) by proving asymptotic equivalence over a class of functions, F m,n, growing in m and n, which almost surely contains the f i. Then fixing an orthonormal basis {ψ k } k=1, the synthesis of 1.0.4, reduces the study of MISE in to the study of the infinite dimensional random effects model Y i (ψ k ) = ξ ik + m 1/2 z ik, z k i.i.d. N(0, 1), k N, i = 1,..., n (1.0.8) under l 2 error. Here assumptions guarantee that the ξ ik = 1 0 f iψ k are realizations of mean 0 normal variables with covariance structure Eξ il ξ jm = δ ij ψ l, Cψ m, where δ ij = 1(i = j). Let λ k = Var(ξ ik ) = ψ k, Cψ k. In the case of zero correlation the Wiener filter, which recovers the f i from Fourier coefficient estimates ˆξ ik = λ k λ k + m 1 Y i(ψ k ) (1.0.9)

14 Chapter 1. Introduction 6 is optimal, with risk behaving like E ˆf i f i 2 L min(λ 2 k, m 1 ). (1.0.10) k=1 This is minimized when {ψ k } k N is taken to be the Karhunen Loève basis. Nevertheless, as long as the λ k decay in the same way as the Karhunen Loève eigenvalues this rate will remain unchanged, even in the presence of correlation among the ξ ik. For this reason, we expect Wiener filtering to perform near optimally for a broad range of bases. Unfortunately this is an oracle estimation strategy and we will not, in general, know the variance structure needed to employ it. Classically oracle estimators of similar form to have been mimicked by information pooling. Fourier coefficients are grouped into blocks with coefficients of similar size and clever estimation strategies employed to mimic the oracle, see e.g. Cai [5], Zhang [47]. The setup of our problem provides an opportunity to expand on this in a novel way. We do this by using repeated observations, as opposed to frequency blocking, as a means of information pooling. In particular we block Fourier coefficients across individuals and perform Stein type thresholding. An oracle inequality for the conditional risks of estimating the ξ ik by this method is derived and used to show that the E fi ˆf i f i 2 L 2 attain the rate with high probability. Here E( ) represents full expectation while E fi ( ) = E( f i ) represents expectation conditioned on the realized trajectories. We then draw a connection to the classical theory by considering a smallest collection of restricted parameter spaces F m,n containing the f 1,..., f n almost surely as m, n. First, we are able to show that max i n E f i ˆf i f i 2 L 2 = o a.s.(r m (F m,n ))

15 Chapter 1. Introduction 7 and so the f i are, in a sense, simultaneously recovered in a superefficient manner. Further, for any f F m,n we may consider the white noise experiment In relation to this parameter space, we show that the estimator ˆf = k=1 α mn,ky (ψ k )ψ k, where the α mn,k are the threshold estimates derived from our strategy, are near asymptotically rate adaptive with sup E ˆf f 2 L (1 + 2 o(1))(log(mn))(2r+1)/(2r+2) R m (F m,n ) f F m,n when λ k k 2(r+1). Thus, in a very real sense this is a near minimax strategy which may have applications beyond recovery of the f i. Recently similar strategies have been employed successfully in the context of FDA. Meister [32] introduced white noise approximations to study the Functional Linear Model (FLM). In his paper, Meister was able to show that FLM is asymptotically equivalent to an inverse problem in white noise Y (dt) = [C 1/2 θ](t)dt + n 1/2 B(dt), t [0, 1] and use this to provide a relatively simple characterization of minimax rates for estimation of θ. In Lei [30] this framework has been applied to the problem of testing H 0 : θ = 0 vs. H 1 : θ 0 for the FLM and we expect other applications to follow.

16 Chapter 2 White Noise Equivalence for Functional Data 2.1 Le Cam Equivalence We begin by discussing the notion of Le Cam equivalence employed in Brown and Low [3] and Reiß [37] following an amalgam of notation from these sources. For thorough overviews, refer to Le Cam and Lo Yang [29], Le Cam [28]. In the abstract, we assume that we have two sequences of experiments E m,n = {(X m 1, B m 1, P m 1,θ), θ Θ n } and G m,n = {(X m 2, B m 2, P m 2,θ), θ Θ n }. The experiments consist of standard probability triples, changing in m and indexed by an identical parameter space Θ n, possibly changing with the indices n. Although the parameter space is taken to be the same for the two experiments, the probability triples may be entirely different. We assume that we make decisions about θ Θ n based on the outcome of the experiments. These decisions, δ i : X m i A, are assumed to take values in a common 8

17 Chapter 2. White Noise Equivalence for Functional Data 9 action space A and we measure their quality with loss functions L = L n : Θ n A [0, ) through the corresponding risk 1 R i (δ i, L, θ) = E P m i,θ L(δ i, θ). With pseudo norm L = L n = sup{l n (θ, a) : θ Θ n, a A}, Le Cam s notion of equivalence, as employed in the references listed above, says these experiments are asymptotically equivalent if the distance (E m,n, G m,n ) := max [ inf δ 1 inf δ 2 sup δ 2 sup δ 1 sup θ Θ n, L =1 sup θ Θ n, L =1 R 1 (δ 1, L, θ) R 2 (δ 2, L, θ), R 1 (δ 1, L, θ) R 2 (δ 2, L, θ) ], (2.1.1) tends to 0 as m, n. One of the salient features of Le Cam equivalence is the implication that for any decision rule δ 1, in the first experiment, there is a corresponding rule δ 2, in the other, so that sup R 1 (δ 1, L, θ) R 2 (δ 2, L, θ) = o(1) θ Θ n, L =1 and visa versa. Since this holds for all θ Θ n and bounded L, simultaneously, we see that we may learn about the risk of estimation in one experiment by studying it in the context of the other. The Le Cam distance is a challenging quantity to bound. The standard route in the current context is to construct a sufficient statistic which maps the sample space of one experiment to that of the other. As these preserve information, the corresponding experiment has 0 Le Cam distance from the original one and so the triangle inequality reduces the problem to comparing experiments on the same sample space. This is a simpler problem for which bounds exist in terms of well known probability metrics 1 In the case of a randomized decision rule, δ = δ(x, ) is a probability measure on the action space A and we take L(δ, θ) = L(a, θ)δ(x, da). A

18 Chapter 2. White Noise Equivalence for Functional Data 10 such as total variation distance. 2.2 White Noise Equivalence for Nonparametric Data In the context of nonparametric regression Brown et al. [4], Brown and Low [3] and Reiß [37], among others, derive equivalence of nonparametric regression to the white noise model for various sampling designs. In particular, a function space Θ n L 2 [0, 1] is assumed and, with the notation of the previous section, E m,n is taken to consist of the probability spaces generated by the non-parametric experiment as f ranges over Θ n. Various designs x i may be considered, but the basic requirement is that it grow dense in [0, 1] as m is increased. In the cited references, the design is either taken to be random and uniform on [0, 1] or near equi-spaced in the sense that max i x i x i 1 = o(1), at some regularity determined rate. These functions also generate the collection of probability spaces corresponding to the diffusions and we will denote the corresponding experiment by G m,n. In [37], Reiss establishes a bound on the Le Cam Distance, (E m,n, G m,n ), of (E m,n, G m,n ) m 1/2 sup f Θ n f I m f L 2, where I m f is a projection of f into a design dependent interpolation space. He then proceeds to find bounds on this distance for a number of standard function classes used in nonparametric regression. Here we will be interested in bounds for growing Sobolev balls Θ n = F S (s, R n ) := {f H s ([0, 1]) : f H s R n } (2.2.1)

19 Chapter 2. White Noise Equivalence for Functional Data 11 with R n, H s ([0, 1]) denoting the Sobolev space H s ([0, 1]) = {f : f, f (1),..., f (s 1) C([0, 1]) and f (s) L 2 ([0, 1])}, (2.2.2) f 2 H s f 2 L 2 + f (s) 2 L 2 and s N. For these spaces and equi-spaced design, Reiss found the bound (E m,n, G m,n ) m 1/2 s R n. (2.2.3) For s > 1/2 and R n slowly enough, this allows us to establish Le Cam equivalence between E m,n and G m,n for the growing sequence of Sobolev balls, Θ n. 2.3 Functional Data and White Noise Representations We reiterate the data setup and elaborate on assumptions. Functions f 1,..., f n are assumed to be independent realizations of a mean 0 Gaussian distribution over H s ([0, 1]). Further, the covariance kernel C(s, t) = Ef(s)f(t) is assumed to satisfy the Sacks Ylvisacker conditions of order r s (see 7.0.1). In this context, our data take the form of noisy and intermittent observations on the f i as in We wish to use the white noise models to study the risk of estimating the f i from in the situation where the design is equi-spaced, x ij = j/m, j = 1,..., m. In this direction, we have the following theorem Theorem Assume, in addition to the conditions listed above, that the integral operator corresponding to C may be simultaneously diagonalized with the integral operator corresponding to the reproducing kernel of H s ([0, 1]). Suppose further that m, n in such a way that m 1/2 s (log n) 1/2 0. Then for any estimation strategy

20 Chapter 2. White Noise Equivalence for Functional Data 12 δ 2 in 1.0.7, there is a corresponding estimation strategy δ 1 in so that sup R 1 (δ 1, L, f i ) R 2 (δ 2, L, f i ) = o a.s. (1). i n, L =1 Consequently, we may study the risk of recovering the f i from in the context of The proof of this theorem relies on the following lemma; Lemma Suppose K is the reproducing kernel of H s ([0, 1]) and C is the covariance kernel of a Gaussian process f living on [0, 1]. With abuse of notation, we also use K and C to denote the integral operators on L 2 [0, 1] generated by these kernels. Suppose further that that C satisfies the Sacks-Ylvisacker conditions of r s and that C and K may be simultaneously diagonalized. Thus there are eigenvalues λ k and an orthonormal basis of L 2 [0, 1], ϕ k, so that Q = K 1/2 CK 1/2 satisfies Qϕ k = λ k ϕ k. Then we have the representations f = η k (K 1/2 ϕ k ) and f 2 H = ηk. 2 (2.3.1) s k=1 k=1 Further, the η k are independent, satisfy η k N(0, λ k ) and and the eigenvalues decay like λ k k 2(r+1)+2s. Proof. We follow the arguments of Yuan and Cai [46] to determine the decay of the eigenvalues of Q = K 1/2 CK 1/2. First, we let K p denote the reproducing kernel of H p ([0, 1]), for arbitrary p, and λ k (O) the k-th eigenvalue of an operator O. Then the proof of theorem 5 from [46] shows that for q > p, Kp 1/2 K q Kp 1/2 is equivalent to K q p and λ k (Kp 1/2 K q Kp 1/2 ) k 2(q p). Further, the proof shows that if C satisfies the Sacks-Ylvisacker conditions of order r p, then λ k (K 1/2 p CKp 1/2 λ k (Kp 1/2 K r+1 Kp 1/2 ) k 2(r+1)+p, and this establishes the decay quoted in the theorem. )

21 Chapter 2. White Noise Equivalence for Functional Data 13 The representations f = k=1 η k(k 1/2 ϕ k ), f 2 H s = k=1 η2 k of the η k and independence N(0, λ k ) follow immediately from the assumptions and the results of Kadota [27] on representations of Gaussian processes. proof of theorem We employ ideas and concepts from Kadota [27], Ritter [38] and Yuan and Cai [46] to find a constant R n so that the norms f i 2 H s > R n only finitely often as n. With this R n and Θ n as in we are guaranteed that f 1,..., f n Θ n as n and so the bound holds over a parameter space containing f 1,..., f n. Let K denote the reproducing kernel for H s ([0, 1]) and assume that C and K may be simultaneously diagonalized. Thus, letting Q = K 1/2 CK 1/2, the assumption is that Q is defined on a dense subset of L 2 [0, 1] and there are eigenvalues {λ i } i N and eigenfunctions {ϕ i } i N, which form a basis for L 2 [0, 1], so that Qϕ i = λ i ϕ i for i = 1,..., n. As in [27], we may write f N (0, C) as f = k=1 η kk 1/2 ϕ k, and in this representation, the η k are independent with η k N(0, λ k ). Further, the simultaneous diagonalization gives f 2 H s ([0,1]) = f, K 1 f = ηk. 2 k=1 Arguments from [38] and [46] (in particular theorem 5 of [46]) adapt to show that that λ k k 2(r+1)+2s = O(k 2 ) under the assumption that r s. Denote by T r(q) the sum k=1 λ k <. Since λ k k 2(r+1)+2s = O(k 2 ) the sum T r(q) = k=1 λ k < and we have that the ratio γ(q) = T r(q 2 )/ (T r(q)) 2 B is bounded. Lemma then gives that P ( f 2 H s ([0,1]) T r(q) ( )] ) α 2 > α T r(q) exp [ c min B, α. B 1/2

22 Chapter 2. White Noise Equivalence for Functional Data 14 A union bound then implies that for f 1,..., f n N (0, C), ( p n (α) := P max i n fi 2 H s ([0,1]) T r(q) ) > α T r(q) [ ( )] α 2 n exp c min B, α. B 1/2 Choosing α n = 3B 1/2 log n/c gives p n (α n ) n 2, which is summable. Borel-Cantelli implies max i n fi 2 H s ([0,1]) T r(q) > αn T r(q) happens only finitely often. In consequence, if for some C > 1 we take R n = (C(1 + α n )T r(q)) 1/2, then as long as n, m diverge in such a way that m 1/2 s (log n) 1/2 is tending to 0, we eventually have f 1,..., f n Θ n as n (since T r(q) < while t n ). Hence for any δ 1 in the regression experiment, there is a δ 2 in the white noise experiment, and vice versa, so that sup R 1 (δ 1, L, f i ) R 2 (δ 2, L, f i ) = o a.s. (1). i n, L =1 Before moving on, some comments and discussion are in order. The decision rules δ i are generalized and for data of the form 1.0.2, 1.0.3, with the risks R i (δ i, L, f i ) conditioned on the f i. Our interest is in MISE but the risks covered by the theorem are arbitrary modulo the restriction that the loss functions defining them be bounded uniformly over the parameter spaces considered. Although the theorem and corresponding proof employ some technical and unintuitive notions, these are in place for the sole purpose of allowing a simpler underlying theme to take form: Combined with Gaussianity of the underlying process they guarantee that the norms f i H s are well

23 Chapter 2. White Noise Equivalence for Functional Data 15 behaved random variables and their maximum may be easily bounded by a slowly diverging value, R n, which is exceeded only finitely often. Under the assumptions of the theorem, this makes it possible to choose an R n so that m 1/2 s R n tends to 0 and in turn guarantees statistical equivalence of the experiments and over the function classes Θ n = F S (s, R n ), from above. Now Θ n eventually contains all of the f i and once this has happened for any δ 2 we are guaranteed a δ 1 so that sup R 1 (δ 1, L, f i ) R 2 (δ 2, L, f i ) (E m,n, G m,n ) m 1/2 s R n, (2.3.2) i n, L =1 and vice versa. Because this is tending to 0, the risks of recovering the f i from may be modelled by the risks of recovering the f i from In the context of non equi-spaced design, there is still something to say when s is large enough. To this end, we consider a third experiment F m,n generated by y i = f(i/m) + z i for i = 1,..., m, as f ranges over Θ n. Thus F m,n is equi-spaced and by the result just stated, the Le Cam distance satisfies (F m,n, G m,n ) m 1/2 s R n. In the particular case of a pair of multivariate normal experiments with identical variance structure, Theorem 3.1 in Brown and Low [3] and the results that follow yield the bound (E m,n, F m,n ) sup f Θ n H(N( f(x i ), I m ), N( f(i/m), I m )), where we denote by a i the vector (a 1,..., a m ) T and H(, ) is the Hellinger distance.

24 Chapter 2. White Noise Equivalence for Functional Data 16 In the case s > 1, it follows from standard results that H 2 (N( f(x i ), I m ), N( f(i/m), I m )) m (f(x i ) f(i/m)) 2 i=1 f 2 x i i/m 2. Hence if the design in conjunction with Θ n satisfy sup f Θn f x i i/m 0, then the triangle inequlity gives (E m,n, G m,n ) 0 and equivalence holds for nonequispaced designs satisfying these conditions. The design condition x i i/m 0 is satisfied if sup i m x i i/m = O(m (ɛ+1/2) ) while for s > 1, f may be bounded by sup f Θn f H s. Thus for the specified design condition, if the latter quantity is o(m ɛ ) as m, n, as for R n = O(log n) when n is bounded by a polynomial power of m, the equivalence will hold.

25 Chapter 3 Stein Estimation and Oracle Inequalities Having justified studying in place of 1.0.1, we move forward in this direction. The basic procedure is to fix a basis {ψ k } k=1 of L2 [0, 1] and block the corresponding noisy Fourier coefficients obtained from across individuals, frequency by frequency, to form the vectors Y k = (Y 1 (ψ k ),..., Y n (ψ k )) T, for k N. (3.0.1) We then estimate ξ k = (ξ 1k,..., ξ nk ) T, for k = 1,..., m, by Stein-type estimation ˆξ k = α mn,k Y k where α mn,k = and construct corresponding estimates of the f i by ( ) n/m 1 c n,m, (3.0.2) Y k 2 + ˆf i = m ˆξ ik ψ k. (3.0.3) k=1 Here the constants c n,m = 1 + o(1) as m, n are chosen to improve the estimator by, with high probability, forcing α mn,k = 0 for most frequencies k with 17

26 Chapter 3. Stein Estimation and Oracle Inequalities 18 small signal to noise ratio (SNR). Properties of these estimators are explored by an oracle inequality approach. Standard oracle inequalities for Stein estimation in this context, as in e.g. Candes [11], Johnstone [26] and Tsybakov [43], bound the total risks E ξk ˆξ k ξ k 2. Although this works well in the classical setting where total risk contributes to MISE, this is not the case in the current application. This is because we need the conditional MSE s of estimating ξ ik by ˆξ ik to calculate E fi ˆf i f i 2 L 2 by and it is unclear how to extrapolate this from classical oracle inequalities. One of the novel contributions of this thesis is to derive new oracle inequalities for Stein estimation in this setting. To this end, we employ conditional concentration of measure to derive oracle inequalities which hold simultaneously and with high probability. This is presented first in the simplified setting of bounding MSE for the components of a single ˆξ k. Results are then lifted to bound the E fi ˆf i f i 2 L 2 and requirements on {ψ k } k=1 in relation to the process generating the f i are discussed. Another novel point is that we are able to recover the f i optimally without directly estimating the covariance structure, C, as is standard practice in FDA approaches. Moving forward, we simplify notation, replacing m 1/2 by τ, and study the problem of estimating the ξ k from and under the distributional assumptions of the introduction. Before presenting risk properties for this estimation strategy, we provide some motivation for the estimator. First consider the case where c n,m = 1. The motivation behind this estimator is that for large n the distribution of Y k 2 concentrates heavily at n(λ k +τ 2 ), in which case ˆξ k ˆξ o k = λ k λ k + τ 2 Y k. This is the linear estimator formed from Y k that an oracle seeking to minimize E(ˆξ ik ξ ik ) 2, i = 1,..., n, would use. On average we expect this estimator to perform near optimally in terms of componentwise MSE. In fact, if we were to use this oracle

27 Chapter 3. Stein Estimation and Oracle Inequalities 19 strategy, ˆξ o k, to estimate ξ we would find conditional MSE s, R i,k = E fi (ˆξ o ik ξ ik) 2, of R o i,k = λ kτ 2 λ k + τ 2 + τ 4 (λ k + τ 2 ) 2 (ξ2 ik λ k ) for i = 1,..., n. (3.0.4) Using these oracle weights in conjunction with concentration of measure for quadratic forms we find, given mild assumptions on the covariances Cψ j, ψ k, that this oracle would incur a risk of estimation as τ 0. R o i := E fi ˆf o i f i 2 L 2 = (1 + o a.s.(1)) k=1 λ k τ 2 λ k + τ 2, (3.0.5) If we naively take c m,n = 1, this estimation strategy runs into problems. On a set of high probability we have (1 δ)(λ k + τ 2 ) Y k 2 /n (1 + δ)(λ k + τ 2 ), for k = 1,..., m, when δ (log m log n/n) 1/2 and taking c n,m = 1, the best we can say is that on this event α mn,k comes within an additive factor of δ of the optimal value of λ k /(λ k + τ 2 ). This approach yields sub-optimal risk bounds in the low signal to noise ratio (SNR) regime. Nevertheless, when λ k << τ 2, λ k /(λ k + τ 2 ) may be significantly closer to 0 than δ and in this case we might do better by choosing c m,n to guarantee that α mn,k = 0 with high probability. In fact, by taking c n,m = (1 + 2δ) = 1 + o(1) we find that with high probability we are estimating the ξ ik as 0 whenever λ k δτ 2 /(1 + δ). Given the model assumptions this condition defines a threshold β mn given by ( ) 1/4(r+1) n β mn m 1/2(r+1), log m log n which is a slight inflation of the truncation point that a projection oracle would choose! For frequencies k β mn we have a guarantee that with high probability we are estimating the ξ ik as 0 and this allows us to significantly improve the risk properties of our estimator from those for c m,n = 1.

28 Chapter 3. Stein Estimation and Oracle Inequalities 20 In relation to this we have the following theorem connecting the risk of Stein estimation to the conditional oracle risks. Theorem For ξ ik = f i, ψ k, λ k = Var( f 1, ψ k ) and ξ k as above, we have ξ k N n (0, λ k I n ) and Y k N n (ξ k, τ 2 I n ). For δ (0, 1/2) set A δ = m k=1 A k,δ with A k,δ defined by complement through the relation A c k,δ = {(1 δ)n(λ k + τ 2 ) Y k 2 (1 + δ)n(λ k + τ 2 )}. Now take c n,m = 1 + 2δ and let ˆξ k denote the Stein estimates from Then for all i = 1,..., n, k = 1,..., m it holds that E fi (ˆξ ik ξ ik ) 2 R o ik + e ik, (3.0.6) where ( ) [ ] e ik = max 1, ξ2 ik C δ min(λ k, δτ 2 ) + C τ P 1/2 i (A δ )(λ k + τ 2 ). (3.0.7) λ k Here C δ 3(6 + δ)(1 + 3δ/(1 δ))/(1 δ), C τ = ( ) are both bounded constants and P i ( ) = P( f i ) is the probability measure conditioned on f i. Further, the probabilities in satisfy the bounds P i (A δ ) 3 exp(δ max k m ξ2 ik/2λ k ) exp( nδ 2 /6 + log m). (3.0.8) Before the proof, some comments are in order. Although this result has a similar feel to standard oracle inequalities for stein estimation, the distributional assumptions of the problem at hand allow us to make quantitative assertions about the recovery of the individual effects and this seems both genuinely new and applicable to any scenario where similar distributional assumptions may be in play. Further, although

29 Chapter 3. Stein Estimation and Oracle Inequalities 21 Gaussianity simplifies the statement of the problem and corresponding calculations, the key property resulting from this assumption is sub-gaussianity. This implies a Hanson-Wright style inequality for concentration of quadratic forms of the Fourier coefficients which is used in the proof of the theorem and the derivation of subsequent results. The other key point is that it is possible to make the theory deal with harsher error distributions; for instance, with finite second moments one can construct sub- Gaussian estimators of mean and this allows to carry the theory over to much harsher error distributions with small amendments to the estimator. Sub-Gaussianity (or near this) will always be required of the underlying functional distribution in lifting the results to bound the error of L 2 recovery, but we do not feel that this is overly restrictive. Requiring decay of the form P( L(f) > t) K exp( Ct b ), b > 0 for all linear functionals of the underlying process ensures that we are taking samples from a function class which is effectively specified by the variances, λ k, of the fourier coefficients. In this case the rescaled maximal fourier coefficients for any given sample path grow like (log k) 1/b keeping the size of the k-th Fourier coefficient to within logarithmic factors of λ 1/2 k the sample paths. and so these are seen to tell us about the regularity of If instead we assumed decay of the form P( L(f) > t) Kt b, b > 0 then the corresponding rescaled fourier coefficients would be growing like k 1/b and we are effectively sampling from another regularity class. We might have achieved this with Fourier coefficients having variances decaying like k 2/b λ k and tighter concentration for functionals of the process. proof of theorem We first establish the inequality in the case where λ k > δτ 2 /(1+ δ). We may write the Stein estimator of ξ ik as ˆξ ik = α n,k y ik = λ ( k λ k + τ y 2 ik + α n,k λ ) k y λ k + τ 2 ik. Using that y ik = ξ ik +τz ik with z ik N(0, 1) allows us to write λ k y ik /(λ k +τ 2 ) ξ ik =

30 Chapter 3. Stein Estimation and Oracle Inequalities 22 (λ k τz ik τ 2 ξ ik )/(λ k + τ 2 ) and we find ( E fi (ˆξ ik ξ ik ) 2 = R o i,k + E fi α n,k λ ) 2 k y 2 λ k + τ 2 ik }{{} I ( ) ( λk τz ik τ 2 ξ ik +2 E fi α λ k + τ 2 n,k λ ) k y λ k + τ 2 ik. }{{} II We proceed by bounding the terms I and II. Now δ (0, 1/2) and on the event A c δ the norm Y k satisfies the bounds (1 δ)n(λ k + τ 2 ) Y k 2 (1 + δ)n(λ k + τ 2 ), which gives that on A c δ α n,k λ k λ k + τ 2 3δ 1 δ τ 2 λ k + τ = C δτ 2 2 δ λ k + τ, 2 where we have fixed C δ = 3/(1 δ). Since both α n,k and λ k /(λ k + τ 2 ) lie in the interval (0, 1), this quantity is always bounded by 2. Using that E fi y 2 ik = ξ2 ik + τ 2, τ 2 /(λ k + τ 2 ) 1 gives ( E fi α n,k λ ) 2 k y 2 λ k + τ 2 ( ) ξ ik1 A c δ Cδ 2 δ 2 τ 2 2 ik + τ 2 λ k + τ 2 C 2 δ δ 2 τ 2 max ( 1, ξ2 ik λ k ). Further, we have y 4 ik 8(ξ4 ik + τ 4 z 4 ik ) which gives (E f i y 4 i ) 1/2 (8(ξ 4 i + 3τ 4 )) 1/2 24(ξ 2 ik +τ 2 ) while writing ξ 2 ik +τ 2 = (ξ 2 ik /λ k)λ k +τ 2 gives that ξ 2 ik +τ 2 max(1, ξ 2 ik /λ k)(λ k + τ 2 ). In the range under consideration, δτ 2 /(1 + δ) = min(λ k, δτ 2 /(1 + δ)) implies δτ 2 (1 + δ) min(λ k, δτ 2 ) and so an application of lemma yields ( ) ( I max 1, ξ2 ik Cδ 2 δ(1 + δ) min(λ k, δτ 2 ) + ) 24P 1/2 i (A δ )(λ k + τ 2 ). λ k It remains to bound the final term in the expression for E fi (ˆξ ik ξ ik ) 2. We

31 Chapter 3. Stein Estimation and Oracle Inequalities 23 begin by writing (λ k τz ik τ 2 ξ ik )y ik = (λ k τz ik τ 2 ξ ik )(ξ ik + τz ik ) and expand to get (λ k τz ik τ 2 ξ ik )y ik = λ k τ 2 z 2 ik τ 2 ξ 2 ik + (λ kτ τ 3 )z ik ξ ik. The final term to bound may be now be written ( II = E fi α n,k λ k λ k + τ 2 ) ( λk τ 2 z 2 ik τ 2 ξ 2 ik + (λ kτ τ 3 )z ik ξ ik λ k + τ 2 We pass the expectation through and bound this quantity term by term. For the first term, using that E fi z 2 i 1 Aδ E fi z 2 i = 1, we get a bound of ( E fi α n,k λ ) k λk τ 2 zik 2 λ k + τ 2 λ k + τ 1 δλ k τ 4 2 A δ C δ (λ k + τ 2 ) C δλ k τ 2 2 δ λ k + τ. 2 Similarly, for the second term we find that τ ( 2 ξik 2 λ k + τ E 2 f i α n,k λ ) k δξik 2 1 λ k + τ 2 Aδ C τ 4 δ (λ k + τ 2 ) C 2 δ Finally, we may write E fi ( α n,k λ ) ( k z λ k + τ 2 ik 1 Aδ = E fi α n,k ( ξ 2 ik λ k ). ) δλk τ 2 λ k + τ 2. λ ) k z λ k + τ 2 ik (1 zik <0 + 1 zik 0)1 Aδ and use that E fi z ik 1 zik 01 Aδ and E fi z ik 1 zik <01 Aδ are both bounded by E fi z ik 1 zik 0 = (2π) 1/2 while 2(2π) 1/2 1 to arrive at a bound of ( E fi α n,k λ ) k δτ 2 z λ k + τ 2 ik 1 Aδ C δ λ k + τ. 2 Arguing in the same way, we find a lower bound of C δ δτ 2 /(λ k + τ 2 ). Now if a, b are arbitrary numbers with b B and c and d are positive numbers, then a b (c d) = a b (max(c, d) min(c, d)) a B max(c, d). Using this, we find that ( (λ k τ τ 3 )ξ ik E λ k + τ 2 fi α n,k λ ) k δ ξ ik τ 3 max(λ k, τ 2 ) z λ k + τ 2 ik 1 Aδ C δ. (λ k + τ 2 ) 2

32 Chapter 3. Stein Estimation and Oracle Inequalities 24 Now observe that for any α (0, 2), since ab (a 2 + b 2 )/2, we have that 2δ 1 α/2 τ δα/2 ξ ik τ 2 λ k + τ 2 δ 2 α τ 2 + δ α ξ 2 ik τ 4 ( ξ (λ k + τ 2 ) 2 δ2 α τ 2 + δ α 2 ik λ k ) λk τ 2 λ k + τ 2. We observe that (λ k τz ik τ 2 ξ ik ) 2 2(λ 2 k τ 2 z 2 ik + τ 4 ξ 2 ik ) and that y2 ik 2(τ 2 z 2 ik + ξ2 ik ). Then expanding (λ k τz ik τ 2 ξ ik ) 2 y 2 ik and noting that E f i z 4 ik = 3 we arrive at the bound E fi (λ k τz ik τ 2 ξ ik ) 2 y 2 ik 12(λ2 k τ 2 +τ 4 ξ 2 ik )(τ 2 +ξ 2 ik ) 12λ kτ 2 (max(1, ξ 2 ik /λ k)(λ k +τ 2 )) 2, which gives the bound ( ) λk τz ii τ 2 2 ξ ik E fi y 2 λ k + τ 2 ik 12(max(1, ξik/λ 2 k )(λ k + τ 2 )) 2. An application of lemma gives ( ) [ 2II max 1, ξ2 ik C δ (4δ + δ α ) λ kτ 2 λ k λ k + τ 2 +C δ δ 2 α τ P 1/2 i (A δ )(λ k + τ 2 ) ] For δ (0, 1/2), λ k δτ 2 /(λ k + τ 2 ) λ k δτ 2 /(λ k + δτ 2 ) min(λ k, δτ 2 ), while in the range under consideration, δτ 2 (1 + δ) min(λ k, δτ 2 ). Taking α = 1, this reduces to ( ) [ 2II max 1, ξ2 ik 6C δ (1 + δ) min(λ k, δτ 2 ) + 4 ] 12P 1/2 i (A δ )(λ k + τ 2 ). λ k Combining bounds for terms I and II gives the bound ( ) [ E fi (ˆξ ik ξ ik ) 2 R o i,k + max 1, ξ2 ik C δ min(λ k, δτ 2 ) λ k +C τ P 1/2 i (A δ )(λ k + τ 2 ) ],

33 Chapter 3. Stein Estimation and Oracle Inequalities 25 where C δ = (6 + δ + δ(1 + δ)c δ) C δ and C τ = ( ) and this provides the bound of the theorem in the case that λ k > δτ 2. In the case that λ k δτ 2 /(1 + δ), min(λ k, δτ 2 /(1 + δ)) = λ k and we have that α n,k = 0 on the event A δ which gives that E fi (ˆξ ik ξ ik ) 2 1 Aδ ξik 2. We also have E fi (ˆξ ik ξ ik ) 4 4E fi (ξ 2 ik +τ2 z 2 ik )2 24(ξ 4 ik +τ4 ) and so an application of lemma gives E fi (ˆξ ik ξ ik ) 2 ξik P 1/2 i (A δ )(ξik 2 + τ 2 ) ( ) [ max 1, ξ2 ik min(λ k, δτ 2 ) + ] 24P 1/2 i (A δ )(λ k + τ 2 ) λ k and this implies the bound of the theorem in the second range. Finally, noticing that lemma gives P i (A k,δ ) 3 exp(δξ 2 i,k /2λ k) exp( nδ 2 /6) while δξ 2 i,k /2λ k max k m ξ 2 i,k /2λ k, a union bound concludes the proof.

34 Chapter 4 Minimax Recovery of Functional Data We now lift the results of theorem to our estimation problem and draw a connection to the classical theory. This is done by considering the smallest collection of parameter spaces F m,n which contains the f 1,..., f n almost surely as m, n. For any f F m,n we may consider the white noise experiment Y (dt) = f(t)dt + m 1/2 B(dt). In relation to this parameter space, we show that the estimator is nearly minimax rate adaptive over F m,n and performs below the minimax rate for any function drawn from a similar distribution to that generating the f i. This gives our procedure the flavour of a minimax robust strategy which may have applications beyond recovery of the f i. The procedure is to estimate ξ k by for k m and 0 for k > m. Then with 26

35 Chapter 4. Minimax Recovery of Functional Data 27 ˆf i as in 3.0.3, isometry gives E fi ˆf i f i 2 L 2 = m k=1 E fi (ˆξ i,k ξ i,k ) 2 + k>m which we may bound using the oracle inequality from Theorem and concentration of measure. Theorem bounds the risks E fi (ˆξ i,k ξ i,k ) 2 by R i,k + e i,k for i = 1,..., n and k = 1,..., m. With R i = m k=1 R i,k and e i = m k=1 e i,k, applying the theorem gives that ξ 2 i,k, E fi ˆf i f i 2 L 2 R i + e i + k>m ξ 2 i,k, i = 1,..., n. (4.0.1) We shall examine and bound the terms on the right hand side separately but first outline and discuss assumptions required for this task. For the bound on the R i we begin by noticing that these terms may be split into random and deterministic components as R i = m k=1 λ k τ 2 λ k + τ 2 + ζ i,m, where ζ i,m = m k=1 τ 4 (λ k + τ 2 ) 2 (ξ2 ik λ k ). Although the τ 4 (ξik 2 λ k)/(λ k + τ 2 ) 2 may force R i,k to be quite different from the optimal risk λ k τ 2 /(λ k + τ 2 ) from one value of k to another, cancellations ensure that the ξ i,m concentrate heavily at 0 under relatively mild correlation assumptions. This allows us to place this first part near the optimal risk with high probability. In fact, noting that C = Ef 1 f 1 gives Cov(ξ i,j, ξ i,k ) = ψ j, Cψ k and so the variables x i,k = τ 2 ξ i,k /(λ k +τ 2 ) are mean 0 gaussian with covariance Q ξ given termwise by Q ξ,ij = ψ i, Cψ j τ 4 (λ i + τ 2 )(λ j + τ 2 ). (4.0.2) Following the notation of lemma 7.0.4, we have ζ i,m = x i 2 T r(q ξ ), and so with

36 Chapter 4. Minimax Recovery of Functional Data 28 γ(q ξ ) = T r(q 2 ξ )/(T r(q ξ)) 2 we may apply the concentration result of the lemma. Strong concentration relies on the ratio γ(q ξ ) being small which, in turn, requires assumptions on the correlation between Fourier coefficients in the basis used. Assumption We make the following assumptions regarding the relation of {ψ k } k=1 to C. i.) The variances of the fourier coefficients decay at the Karhunen-Lòeve rate so that, given the Sacks-Ylvisacker assumption of order r, λ k = ψ k, Cψ k k 2(r+1). ii.) The correlations between the Fourier coefficients decay at a reasonable rate with distance between indices in that ψ i, Cψ j (ij) (r+1), (4.0.3) (1 + i j ) β/2 for some β > 1. We make a couple of points regarding these assumptions. For the first assumption, covariance functions which satisfy the Sacks-Ylvisacker conditions of order r generate RKHS s which lie within a polynomial translation of the Sobolev space H r+1 ([0, 1]) and the sample paths of the process just about lie in these (They lie in H ν ([0, 1]) for ν < r + 1/2 and fill this out as we repeatedly sample). There are many comparable smoothness classes which share similar decay of Fourier coefficients when expressed in bases which efficiently represent them and it is reasonable to expect that in such bases the variances of the Fourier coefficients decay at the Karhunen-Lóeve rate. For the second assumption, Cauchy-Schwarz bounds the covariances between Fourier coefficients by ψ i, Cψ j (λ i λ j ) 1/2 (ij) (r+1). This assumption is telling us that when all variances are blown up to the same scale, the correlations between coefficients are bounded by a stationary covariance structure with weak decay. This

37 Chapter 4. Minimax Recovery of Functional Data 29 is also quite reasonable, especially in the case where {ψ k } k=1 are taken to be wavelets. At the right scale, such off diagonal decay conditions hold for broad classes of differential oparators in wavelet bases, [14], in which case the same off diagonal decay also holds for the inverses of these operators and these naturally correspond to covariance operators, [38]. With this assumption, we are able to prove the following bounds for γ(q ξ ): Lemma Suppose that C satisfies the Sacks-Ylvisacker conditions of order r and in {ψ k } k=1, λ k = ψ k, Cψ k k 2(r+1) and the ψ i, Cψ j satisfies the decay conditions Suppose further that τ 0 and m in such a way that m 1 = o(τ 1/(r+1) ). Then it follows that γ(q ξ ) τ 1/(r+1) This allows us to establish the following theorem for the purpose of bounding Theorem Suppose that τ 2 0 and n, m so τ 2 m 1 and m ς n m υ for ς, υ > 0. Then for δ = (12κ log n log m/n) 1/2 and with the assumptions listed above on the correlations between the ξ k, it follows that on a set of probability at least 1 + O(n 2 ) the terms bounding the E fi ˆf i f i 2 L satisfy the following inequalities: 2 i.) max i n R i (1 + o(1)) k=1 λ k τ 2 λ k + τ 2 m (2r+1)/(2r+2). ii.) max i n P1/2 i (A δ ) exp(o(1))m κ log n+1/2.

38 Chapter 4. Minimax Recovery of Functional Data 30 iii.) max i n e i 6 log(nm) log(mn) [ C δ k=1 min(λ k, δτ 2 ) +C τ (1 + E f 2 2) max i n P1/2 i (A δ ) ] [δ (2r+1)/(2r+2) m (2r+1)/(2r+2) + m κ log n+1/2 ]. iv.) max i n k>m ξ 2 i,k = (1 + o(1)) k>m λ k m (2r+1). Given the assumptions, δ (2r+1)/(2r+2) log(nm) = o(1) and also m log n+1/2 log(mn) = o(m (2r+1)/(2r+2) ). Thus as m, n, max i n E f i ˆf i f i 2 2 = (1 + o a.s. (1)) which is the oracle rate. k=1 λ k τ 2 λ k + τ 2 m (2r+1)/(2r+2), Before the proof, some remarks. This rate is known to be optimal for L 2 reconstruction of f, in the sense of averaged error, when one has m noisy samples of f at arbitrary points in [0, 1]. See e.g. [38], chapter 5, proposition 3. Whereas the optimal estimator in [38] relies on knowing C, and thus on an oracle s knowledge of the underlying model, we are able to achieve this rate from the data alone. Further, conditionally on the observed trajectories, our rates are shown to hold with high probability which provides sample to sample guarantees that averaged bounds do not. It is not totally transparent from iii.) of theorem 4.0.3, but there is a tradeoff between the terms δ (2r+1)/(2r+2) m (2r+1)/(2r+2) and m κ log n+1/2. Larger δ puts more weight on the factor m (2r+1)/(2r+2) and shrinks m κ log n+1/2 by placing a larger κ in

39 Chapter 4. Minimax Recovery of Functional Data 31 the exponent. Ideally we would like to balance the two terms which happens when we choose δ as small as possible. Taking κ = 3/(2 log n) gives δ = (18 log m/n) 1/2 and m κ log n+1/2 = m 1 = o(m (2r+1)/(2r+2) ), which is about as small as we can take δ without knowing anything about r. In fact with slightly worse constants of proportionality (which still have the order 1±δ), the probabilities used in our proofs concentrate like exp( nδ 2 /4) which means the m κ log n+1/2 3κ log n/2+1/2 should look like m and this justifies taking δ = (12 log m/n) 1/2 for practical implementation. Proof of theorem Let the event D α be defined as D α = { } max ζ i,m > α T r(q ξ ). i n Then choosing α so that cα = 3τ ϕ/2 log n, lemma gives P (D α ) 2n 2. Now λ k τ 4 /(λ k + τ 2 ) 2 λ k τ 2 /(λ k + τ 2 ) gives that T r(q ζ ) m k=1 λ kτ 2 /(λ k + τ 2 ) and so on D c α, we have max i n R i (1 + α) n k=1 λ k τ 2 λ k + τ 2 which satisfies the desired bound as long as α = o(1) as m. In the case that τ 2 m 1 this happens as long as n is bounded by a power of m, which holds by assumption. For γ > 2, on an event of probability 1 (nm) 1 γ2 /2, it holds that max i n,k m λ 1/2 k ξ ik γ(log(nm)) 1/2. In this case we find that the P i (A k,δ ) satisfy P i (A k,δ ) 8 exp(δγ 2 log(nm)/2) exp( nδ 2 /6). By assumption δ log(nm) n 1/2 log n = o(1) and so taking δ = (12 log n/n) 1/2 gives

40 Chapter 4. Minimax Recovery of Functional Data 32 the quoted bound for the P 1/2 i (A k,δ ). Also, on this event, the bound for the e i,k from theorem reduces to e ik = γ 2 log(nm) [ C δ min(λ k, δτ 2 ) + CP 1/2 i (A δ )(λ k + τ 2 ) ]. This sums to yield the bound of (iii). Finally, we take E α to be the event E α = { } max η i,m > α T r(q η ). i n and notice that by lemma 7.0.4, choosing α so that cα = 3m 1/2 log n gives P(E α ) 2n 2. By assumption α n 1/(2γ2) log n = o(1) and so on Eα c the third bound of the theorem holds. A union bound gives that with probability at least 1 4n 2 (nm) 1 γ2 /2 the bounds of the theorem hold with the quoted probability for γ 2 6. We now shift focus to constructing a collection of parameter spaces over which this estimation procedure has a minimax interpretation. With a basis {ψ k } k=1 satisfying the covariance conditions and m, n as in Theorem we define a collection of parameter spaces, F m,n, by F m,n = { f L 2 : f, ψ k a(λ k log(mn)) 1/2 for k m and f, ψ k b(λ k log(nk)) 1/2 for k > m }, (4.0.4) with a, b to be determined. The idea is to find the smallest a, b to guaranteeing that f 1,..., f n F m,n as m, n thus keeping the collection as small as possible.

From Multiple Gaussian Sequences to Functional Data and Beyond: A Stein Estimation Approach

From Multiple Gaussian Sequences to Functional Data and Beyond: A Stein Estimation Approach Mark Koudstaal Department of Statistical Sciences, University of Toronto, Toronto, Ontario, Canada Fang Yao Department