Simultaneous White Noise Models and Optimal Recovery of Functional Data. Mark Koudstaal

Size: px
Start display at page:

Download "Simultaneous White Noise Models and Optimal Recovery of Functional Data. Mark Koudstaal"

Transcription

1 Simultaneous White Noise Models and Optimal Recovery of Functional Data by Mark Koudstaal A thesis submitted in conformity with the requirements for the degree of Doctor of Philosophy Graduate Department of Statistics University of Toronto c Copyright 2015 by Mark Koudstaal

2 Abstract Simultaneous White Noise Models and Optimal Recovery of Functional Data Mark Koudstaal Doctor of Philosophy Graduate Department of Statistics University of Toronto 2015 We consider i.i.d. realizations of a Gaussian process on [0, 1] satisfying prescribed regularity conditions. The data consist of discrete samplings of these realizations in i.i.d. Gaussian noise and the goal is estimation of the underlying trajectories. Further, we want our estimates to enjoy expected L 2 errors, conditioned on the realized trajectories, which attain optimal rates. Under general conditions on both design and process, an asymptotic equivalence, in Le Cam s sense, is established between an experiment which simultaneously describes these realizations and a collection of white noise models. The risk properties of our estimation goal may then be studied in an idealized setting and benchmarks established for practically implementable procedures. In this context, the white noise models are projected onto a basis satisfying general conditions in relation to the covariance kernel of the process which generated the data. This reduces the problem of initial interest to that of recovering a collection of normal means in euclidean norm, the means of interest having Gaussian structure. A variant of Stein estimation is applied for recovery of these means and a key inequality derived showing that the corresponding risks, conditioned on the underlying means, can be made arbitrarily close to those that an oracle with knowledge of the process would attain. This establishes various notions of optimality for our recovery procedure. ii

3 Finally, guarantees are derived for practically implementable variants and empirical performance is illustrated through simulated and real data examples. iii

4 Acknowledgements This thesis owes its existence to a fantastic network of friends and teachers. I m incredibly grateful to have had such a thoughtful and supportive advisor in Fang who devoted a huge amount of time to the development of this work with patient, enthusiastic and encouraging scientific guidance. He has been generous with and confident in me from the outset and I will always value the gift of being given the confidence to take this project in directions that caught my interest. My heartfelt thanks go to my advisory committee for taking the time to listen to rough progress reports on this work, for setting me at ease in committee meetings with kind engaged, thoughtful listening and for encouraging suggestions that have improved and diversified the scope of things. Special thanks to Andrey for generously nurturing my interest in Wavelets! Words can t express my thanks for my parents. They ve made everything possible, been fantastic friends along the way and I m lucky to have them in my life. A champions medal goes to Ali for sharing in the ups and downs of the whole process and the uncertainty of it all. I m so lucky to have such a great friend in my partner. iv

5 Contents 1 Introduction 3 2 White Noise Equivalence for Functional Data Le Cam Equivalence White Noise Equivalence for Nonparametric Data Functional Data and White Noise Representations Stein Estimation and Oracle Inequalities 19 4 Minimax Recovery of Functional Data 28 5 Implementable Procedures with Rigorous Guarantees Estimation of σ From Quantiles Extension to Imperfect Data Models Guarantees for Recovery on Equispaced Design Suggested Directions for Random Design Numerical Experiments Simulation studies Recovery of Normalized Stock Prices Appendix 63 v

6 Bibliography 78 vi

7 List of Tables 6.1 In Sample Recovery. Results are blown up by a factor of 10 3 to preserve space Out of Sample Recovery. Results are blown up by a factor of 10 3 to preserve space Errors calculated on training data. Wavelet and Fourier estimates are those corresponding to the minimum quantile estimates. Results are blown up by a factor of 10 4 to preserve space vii

8 List of Figures 6.1 This plot compares the oracle threshold weights λ k /(λ k + τ 2 ) (blue), for covariance function C 4, with the empirically estimated threshold weights α mn,k for σ (red) and ˆσ (green). As can be seen from the graph, both sets of empirical weights are more conservative than the oracle. This results directly from the fact that the inflation factors (1 + Cδ) used in the α mn,k guarantee that, with high probability, it holds that α mn,k λ k /(λ k + τ 2 ) simultaneously for k = 1,..., m The blue line plots recovery error for the estimate ˆf(α), which uses ˆσ 2 (α) to construct α mn,k, vs α. There red line is the error corresponding to the estimate ˆf(1/m), which lies below the entire error curve suggesting that this value is optimal This plot corresponds to a sample function plotted alongside corresponding oracle and ˆσ 2 (1/m) estimates in wavelet bases This plot corresponds to a sample function plotted alongside corresponding oracle and ˆσ 2 (1/m) estimates in wavelet bases viii

9 Chapter 1 Introduction Functional Data Analysis (FDA) is maturing to encompass a host of techniques and applications in nonparametric statistics. The monographs of Ramsay and Silverman [35, 36] provide overviews and numerous examples. We begin with a setup drawn from FDA: we assume we have a collection of functions f 1,..., f n i.i.d. f, a mean 0 Gaussian distribution over a sufficiently regular function class F L 2 [0, 1] and satisfying E f 2 <. The f i are observed intermittently and with noise so that our data consist of y ij = f i (x ij ) + z ij, z ij i.i.d. N(0, 1) for i = 1,..., n, j = 1,..., m (1.0.1) for design points x ij [0, 1]. Our interest is in modelling the risks of recovering the f i from The covariance kernel, C(s, t) = Ef(s)f(t), induces an integral operator C : L 2 [0, 1] L 2 [0, 1] through the action Cg = 1 0 C(, s)g(s)ds 1

10 Chapter 1. Introduction 2 and the corresponding orthonormal eigenbasis, {ψ k } k=1, ordered with decreasing eigenvalues, is known as the Karhunen-Lòeve (KL) basis. This basis is known to be optimal for linear recovery in that the expected L 2 [0, 1] approximation errors, N E(N) = E f 2 f, ψ p ψ p = p>n Cψ p, ψ p, p=1 are smaller in the KL basis, for every N, than in any other orthonormal basis. The bulk of related literature focuses on using to estimate the eigenstructure of C and the prevailing philosophy is that recovery using quality estimates of this basis will be optimal. Additionally, other procedures of interest to FDA are often performed optimally in this basis. Under the assumptions of densely sampled data, a common approach is to pre-smoothe curves and use the resulting estimates to approximate the covariance structure. This can then be used for expansion of the underlying process. This type of approach has been analyzed in Cardot [12] under the assumptions of a finite dimensional underlying process and equispaced sampling. The PACE procedure, put forward in Yao et al. [45], introduced a clever smoothing procedure for estimation of the eigenstructure of C and proposed a novel conditional expectation approach for recovery of the f i in the estimated basis. Further, this procedure enjoys robustness against sparse and irregularly sampled data common to longitudinal studies. Smoothness and the operation of sampling are naturally expressed in the language of Reproducing Kernel Hilbert Space (RKHS) and this structure has recently been exploited to develop unique procedures relevant to our problem. In Cai and Yuan [7] a smoothing spline approach is taken for recovery of C from 1.0.1, under relaxed assumptions on the design points, and the procedure is shown to come within logarithmic factors of optimality under a matching of the smoothness of the process to to the kernel used for recovery. With a finite dimensional assumption of the un-

11 Chapter 1. Introduction 3 derlying process, the paper Amini and Wainwright [1] develop a novel method for extending recovery of C to noisy sampling schemes in a fairly general class of domains and prove optimality properties. Nevertheless, the primary focus of these techniques is recovery of C. Further, beyond the finite dimentional setting and the notion of consistency, their performance and properties regarding recovery of the f i in more general settings remains unclear. Techniques from nonparametric function estimation might be brought to bear but treating as a collection of separate recovery problems would neglect potentially useful information from the distributional structure of the data. The complicated nature of the data obscure this intuition, and we would like to study the problem in a simplified setting with potential for broad insights. A useful tool, allowing the simplification of statistical problems, is Le Cam s notion of asymptotic equivalence [29]. It is often the case that a complex statistical model, over a given parameter space Θ, is asymptotically equivalent to a simpler model over Θ. In this case both models share identical rates under all bounded loss functions and one may reduce a study of the complicated model to a proof of asymptotic equivalence and a corresponding analysis of the simpler model. In particular, asymptotic equivalence allows one to address questions about complicated issues such as minimaxity in a simplified setting. The white noise model, Y (dt) = f(t)dt + m 1/2 B(dt), t [0, 1] (1.0.2) with B a standard brownian motion, has had a profound impact on the study of many problems in nonparametric function estimation through the notion of asymptotic equivalence. On the one hand work as in Brown and Low [3], Brown et al. [4] and Reiß [37] has established that this model is asymptotically equivalent to the classical

12 Chapter 1. Introduction 4 nonparametric regression experiment y j = f(x j ) + z j, z j i.i.d. N(0, 1), j = 1,..., m, (1.0.3) for various collections of functions F and sampling designs for x j [0, 1] which grow dense as m. On the other hand, given an orthonormal basis {ψ k } k=1 of L2 [0, 1] one may decompose the white noise model into a collection of normal means. Basic properties of Ito integration give Y (ψ k ) = 1 0 ψ k (t)y (dt) = θ k + m 1/2 z k, z k i.i.d. N(0, 1), k = 1, 2,... (1.0.4) with θ k = f, ψ k L 2. Further, given estimates ˆθ k of the θ k one may form an estimate ˆf of f from ˆf = k ˆθ k ψ k and isometry yields E ˆf f 2 L = E(ˆθ 2 k θ k ) 2. (1.0.5) k=1 In this way, we see that estimation in under the MISE metric may be translated via the white noise model into an infinite dimensional normal mean recovery problem under l 2 risk. Various collections of functions of practical interest, F, have natural characterizations in terms of geometric constraints on their Fourier coefficients θ k in an appropriate basis and reduction to can considerably simplify the problem of characterizing the minimax risk R m (F) = inf ˆf sup E ˆf f 2 L2. (1.0.6) f F See Pinsker [34], Donoho [19] and Donoho et al. [21, 22]. Further, construction of

13 Chapter 1. Introduction 5 estimators which attain the minimax rate sup E ˆf f 2 L R m(f) 2 f F over broad classes of functions, F, can be much simpler in the framework of and while providing indications on how to proceed in the general setting of See Candes [11], Cai [6] and Johnstone [26]. We propose that similar insights may be won in the context of FDA by employing the white noise model there. Let B i, i = 1,..., n, be independent Brownian motions. In the body of the paper we will justify studying risk in the recovery problem in the simplified context of the collection of white noise models Y i (dt) = f i (t)dt + m 1/2 B i (dt), for i = 1,..., n, t [0, 1], (1.0.7) by proving asymptotic equivalence over a class of functions, F m,n, growing in m and n, which almost surely contains the f i. Then fixing an orthonormal basis {ψ k } k=1, the synthesis of 1.0.4, reduces the study of MISE in to the study of the infinite dimensional random effects model Y i (ψ k ) = ξ ik + m 1/2 z ik, z k i.i.d. N(0, 1), k N, i = 1,..., n (1.0.8) under l 2 error. Here assumptions guarantee that the ξ ik = 1 0 f iψ k are realizations of mean 0 normal variables with covariance structure Eξ il ξ jm = δ ij ψ l, Cψ m, where δ ij = 1(i = j). Let λ k = Var(ξ ik ) = ψ k, Cψ k. In the case of zero correlation the Wiener filter, which recovers the f i from Fourier coefficient estimates ˆξ ik = λ k λ k + m 1 Y i(ψ k ) (1.0.9)

14 Chapter 1. Introduction 6 is optimal, with risk behaving like E ˆf i f i 2 L min(λ 2 k, m 1 ). (1.0.10) k=1 This is minimized when {ψ k } k N is taken to be the Karhunen Loève basis. Nevertheless, as long as the λ k decay in the same way as the Karhunen Loève eigenvalues this rate will remain unchanged, even in the presence of correlation among the ξ ik. For this reason, we expect Wiener filtering to perform near optimally for a broad range of bases. Unfortunately this is an oracle estimation strategy and we will not, in general, know the variance structure needed to employ it. Classically oracle estimators of similar form to have been mimicked by information pooling. Fourier coefficients are grouped into blocks with coefficients of similar size and clever estimation strategies employed to mimic the oracle, see e.g. Cai [5], Zhang [47]. The setup of our problem provides an opportunity to expand on this in a novel way. We do this by using repeated observations, as opposed to frequency blocking, as a means of information pooling. In particular we block Fourier coefficients across individuals and perform Stein type thresholding. An oracle inequality for the conditional risks of estimating the ξ ik by this method is derived and used to show that the E fi ˆf i f i 2 L 2 attain the rate with high probability. Here E( ) represents full expectation while E fi ( ) = E( f i ) represents expectation conditioned on the realized trajectories. We then draw a connection to the classical theory by considering a smallest collection of restricted parameter spaces F m,n containing the f 1,..., f n almost surely as m, n. First, we are able to show that max i n E f i ˆf i f i 2 L 2 = o a.s.(r m (F m,n ))

15 Chapter 1. Introduction 7 and so the f i are, in a sense, simultaneously recovered in a superefficient manner. Further, for any f F m,n we may consider the white noise experiment In relation to this parameter space, we show that the estimator ˆf = k=1 α mn,ky (ψ k )ψ k, where the α mn,k are the threshold estimates derived from our strategy, are near asymptotically rate adaptive with sup E ˆf f 2 L (1 + 2 o(1))(log(mn))(2r+1)/(2r+2) R m (F m,n ) f F m,n when λ k k 2(r+1). Thus, in a very real sense this is a near minimax strategy which may have applications beyond recovery of the f i. Recently similar strategies have been employed successfully in the context of FDA. Meister [32] introduced white noise approximations to study the Functional Linear Model (FLM). In his paper, Meister was able to show that FLM is asymptotically equivalent to an inverse problem in white noise Y (dt) = [C 1/2 θ](t)dt + n 1/2 B(dt), t [0, 1] and use this to provide a relatively simple characterization of minimax rates for estimation of θ. In Lei [30] this framework has been applied to the problem of testing H 0 : θ = 0 vs. H 1 : θ 0 for the FLM and we expect other applications to follow.

16 Chapter 2 White Noise Equivalence for Functional Data 2.1 Le Cam Equivalence We begin by discussing the notion of Le Cam equivalence employed in Brown and Low [3] and Reiß [37] following an amalgam of notation from these sources. For thorough overviews, refer to Le Cam and Lo Yang [29], Le Cam [28]. In the abstract, we assume that we have two sequences of experiments E m,n = {(X m 1, B m 1, P m 1,θ), θ Θ n } and G m,n = {(X m 2, B m 2, P m 2,θ), θ Θ n }. The experiments consist of standard probability triples, changing in m and indexed by an identical parameter space Θ n, possibly changing with the indices n. Although the parameter space is taken to be the same for the two experiments, the probability triples may be entirely different. We assume that we make decisions about θ Θ n based on the outcome of the experiments. These decisions, δ i : X m i A, are assumed to take values in a common 8

17 Chapter 2. White Noise Equivalence for Functional Data 9 action space A and we measure their quality with loss functions L = L n : Θ n A [0, ) through the corresponding risk 1 R i (δ i, L, θ) = E P m i,θ L(δ i, θ). With pseudo norm L = L n = sup{l n (θ, a) : θ Θ n, a A}, Le Cam s notion of equivalence, as employed in the references listed above, says these experiments are asymptotically equivalent if the distance (E m,n, G m,n ) := max [ inf δ 1 inf δ 2 sup δ 2 sup δ 1 sup θ Θ n, L =1 sup θ Θ n, L =1 R 1 (δ 1, L, θ) R 2 (δ 2, L, θ), R 1 (δ 1, L, θ) R 2 (δ 2, L, θ) ], (2.1.1) tends to 0 as m, n. One of the salient features of Le Cam equivalence is the implication that for any decision rule δ 1, in the first experiment, there is a corresponding rule δ 2, in the other, so that sup R 1 (δ 1, L, θ) R 2 (δ 2, L, θ) = o(1) θ Θ n, L =1 and visa versa. Since this holds for all θ Θ n and bounded L, simultaneously, we see that we may learn about the risk of estimation in one experiment by studying it in the context of the other. The Le Cam distance is a challenging quantity to bound. The standard route in the current context is to construct a sufficient statistic which maps the sample space of one experiment to that of the other. As these preserve information, the corresponding experiment has 0 Le Cam distance from the original one and so the triangle inequality reduces the problem to comparing experiments on the same sample space. This is a simpler problem for which bounds exist in terms of well known probability metrics 1 In the case of a randomized decision rule, δ = δ(x, ) is a probability measure on the action space A and we take L(δ, θ) = L(a, θ)δ(x, da). A

18 Chapter 2. White Noise Equivalence for Functional Data 10 such as total variation distance. 2.2 White Noise Equivalence for Nonparametric Data In the context of nonparametric regression Brown et al. [4], Brown and Low [3] and Reiß [37], among others, derive equivalence of nonparametric regression to the white noise model for various sampling designs. In particular, a function space Θ n L 2 [0, 1] is assumed and, with the notation of the previous section, E m,n is taken to consist of the probability spaces generated by the non-parametric experiment as f ranges over Θ n. Various designs x i may be considered, but the basic requirement is that it grow dense in [0, 1] as m is increased. In the cited references, the design is either taken to be random and uniform on [0, 1] or near equi-spaced in the sense that max i x i x i 1 = o(1), at some regularity determined rate. These functions also generate the collection of probability spaces corresponding to the diffusions and we will denote the corresponding experiment by G m,n. In [37], Reiss establishes a bound on the Le Cam Distance, (E m,n, G m,n ), of (E m,n, G m,n ) m 1/2 sup f Θ n f I m f L 2, where I m f is a projection of f into a design dependent interpolation space. He then proceeds to find bounds on this distance for a number of standard function classes used in nonparametric regression. Here we will be interested in bounds for growing Sobolev balls Θ n = F S (s, R n ) := {f H s ([0, 1]) : f H s R n } (2.2.1)

19 Chapter 2. White Noise Equivalence for Functional Data 11 with R n, H s ([0, 1]) denoting the Sobolev space H s ([0, 1]) = {f : f, f (1),..., f (s 1) C([0, 1]) and f (s) L 2 ([0, 1])}, (2.2.2) f 2 H s f 2 L 2 + f (s) 2 L 2 and s N. For these spaces and equi-spaced design, Reiss found the bound (E m,n, G m,n ) m 1/2 s R n. (2.2.3) For s > 1/2 and R n slowly enough, this allows us to establish Le Cam equivalence between E m,n and G m,n for the growing sequence of Sobolev balls, Θ n. 2.3 Functional Data and White Noise Representations We reiterate the data setup and elaborate on assumptions. Functions f 1,..., f n are assumed to be independent realizations of a mean 0 Gaussian distribution over H s ([0, 1]). Further, the covariance kernel C(s, t) = Ef(s)f(t) is assumed to satisfy the Sacks Ylvisacker conditions of order r s (see 7.0.1). In this context, our data take the form of noisy and intermittent observations on the f i as in We wish to use the white noise models to study the risk of estimating the f i from in the situation where the design is equi-spaced, x ij = j/m, j = 1,..., m. In this direction, we have the following theorem Theorem Assume, in addition to the conditions listed above, that the integral operator corresponding to C may be simultaneously diagonalized with the integral operator corresponding to the reproducing kernel of H s ([0, 1]). Suppose further that m, n in such a way that m 1/2 s (log n) 1/2 0. Then for any estimation strategy

20 Chapter 2. White Noise Equivalence for Functional Data 12 δ 2 in 1.0.7, there is a corresponding estimation strategy δ 1 in so that sup R 1 (δ 1, L, f i ) R 2 (δ 2, L, f i ) = o a.s. (1). i n, L =1 Consequently, we may study the risk of recovering the f i from in the context of The proof of this theorem relies on the following lemma; Lemma Suppose K is the reproducing kernel of H s ([0, 1]) and C is the covariance kernel of a Gaussian process f living on [0, 1]. With abuse of notation, we also use K and C to denote the integral operators on L 2 [0, 1] generated by these kernels. Suppose further that that C satisfies the Sacks-Ylvisacker conditions of r s and that C and K may be simultaneously diagonalized. Thus there are eigenvalues λ k and an orthonormal basis of L 2 [0, 1], ϕ k, so that Q = K 1/2 CK 1/2 satisfies Qϕ k = λ k ϕ k. Then we have the representations f = η k (K 1/2 ϕ k ) and f 2 H = ηk. 2 (2.3.1) s k=1 k=1 Further, the η k are independent, satisfy η k N(0, λ k ) and and the eigenvalues decay like λ k k 2(r+1)+2s. Proof. We follow the arguments of Yuan and Cai [46] to determine the decay of the eigenvalues of Q = K 1/2 CK 1/2. First, we let K p denote the reproducing kernel of H p ([0, 1]), for arbitrary p, and λ k (O) the k-th eigenvalue of an operator O. Then the proof of theorem 5 from [46] shows that for q > p, Kp 1/2 K q Kp 1/2 is equivalent to K q p and λ k (Kp 1/2 K q Kp 1/2 ) k 2(q p). Further, the proof shows that if C satisfies the Sacks-Ylvisacker conditions of order r p, then λ k (K 1/2 p CKp 1/2 λ k (Kp 1/2 K r+1 Kp 1/2 ) k 2(r+1)+p, and this establishes the decay quoted in the theorem. )

21 Chapter 2. White Noise Equivalence for Functional Data 13 The representations f = k=1 η k(k 1/2 ϕ k ), f 2 H s = k=1 η2 k of the η k and independence N(0, λ k ) follow immediately from the assumptions and the results of Kadota [27] on representations of Gaussian processes. proof of theorem We employ ideas and concepts from Kadota [27], Ritter [38] and Yuan and Cai [46] to find a constant R n so that the norms f i 2 H s > R n only finitely often as n. With this R n and Θ n as in we are guaranteed that f 1,..., f n Θ n as n and so the bound holds over a parameter space containing f 1,..., f n. Let K denote the reproducing kernel for H s ([0, 1]) and assume that C and K may be simultaneously diagonalized. Thus, letting Q = K 1/2 CK 1/2, the assumption is that Q is defined on a dense subset of L 2 [0, 1] and there are eigenvalues {λ i } i N and eigenfunctions {ϕ i } i N, which form a basis for L 2 [0, 1], so that Qϕ i = λ i ϕ i for i = 1,..., n. As in [27], we may write f N (0, C) as f = k=1 η kk 1/2 ϕ k, and in this representation, the η k are independent with η k N(0, λ k ). Further, the simultaneous diagonalization gives f 2 H s ([0,1]) = f, K 1 f = ηk. 2 k=1 Arguments from [38] and [46] (in particular theorem 5 of [46]) adapt to show that that λ k k 2(r+1)+2s = O(k 2 ) under the assumption that r s. Denote by T r(q) the sum k=1 λ k <. Since λ k k 2(r+1)+2s = O(k 2 ) the sum T r(q) = k=1 λ k < and we have that the ratio γ(q) = T r(q 2 )/ (T r(q)) 2 B is bounded. Lemma then gives that P ( f 2 H s ([0,1]) T r(q) ( )] ) α 2 > α T r(q) exp [ c min B, α. B 1/2

22 Chapter 2. White Noise Equivalence for Functional Data 14 A union bound then implies that for f 1,..., f n N (0, C), ( p n (α) := P max i n fi 2 H s ([0,1]) T r(q) ) > α T r(q) [ ( )] α 2 n exp c min B, α. B 1/2 Choosing α n = 3B 1/2 log n/c gives p n (α n ) n 2, which is summable. Borel-Cantelli implies max i n fi 2 H s ([0,1]) T r(q) > αn T r(q) happens only finitely often. In consequence, if for some C > 1 we take R n = (C(1 + α n )T r(q)) 1/2, then as long as n, m diverge in such a way that m 1/2 s (log n) 1/2 is tending to 0, we eventually have f 1,..., f n Θ n as n (since T r(q) < while t n ). Hence for any δ 1 in the regression experiment, there is a δ 2 in the white noise experiment, and vice versa, so that sup R 1 (δ 1, L, f i ) R 2 (δ 2, L, f i ) = o a.s. (1). i n, L =1 Before moving on, some comments and discussion are in order. The decision rules δ i are generalized and for data of the form 1.0.2, 1.0.3, with the risks R i (δ i, L, f i ) conditioned on the f i. Our interest is in MISE but the risks covered by the theorem are arbitrary modulo the restriction that the loss functions defining them be bounded uniformly over the parameter spaces considered. Although the theorem and corresponding proof employ some technical and unintuitive notions, these are in place for the sole purpose of allowing a simpler underlying theme to take form: Combined with Gaussianity of the underlying process they guarantee that the norms f i H s are well

23 Chapter 2. White Noise Equivalence for Functional Data 15 behaved random variables and their maximum may be easily bounded by a slowly diverging value, R n, which is exceeded only finitely often. Under the assumptions of the theorem, this makes it possible to choose an R n so that m 1/2 s R n tends to 0 and in turn guarantees statistical equivalence of the experiments and over the function classes Θ n = F S (s, R n ), from above. Now Θ n eventually contains all of the f i and once this has happened for any δ 2 we are guaranteed a δ 1 so that sup R 1 (δ 1, L, f i ) R 2 (δ 2, L, f i ) (E m,n, G m,n ) m 1/2 s R n, (2.3.2) i n, L =1 and vice versa. Because this is tending to 0, the risks of recovering the f i from may be modelled by the risks of recovering the f i from In the context of non equi-spaced design, there is still something to say when s is large enough. To this end, we consider a third experiment F m,n generated by y i = f(i/m) + z i for i = 1,..., m, as f ranges over Θ n. Thus F m,n is equi-spaced and by the result just stated, the Le Cam distance satisfies (F m,n, G m,n ) m 1/2 s R n. In the particular case of a pair of multivariate normal experiments with identical variance structure, Theorem 3.1 in Brown and Low [3] and the results that follow yield the bound (E m,n, F m,n ) sup f Θ n H(N( f(x i ), I m ), N( f(i/m), I m )), where we denote by a i the vector (a 1,..., a m ) T and H(, ) is the Hellinger distance.

24 Chapter 2. White Noise Equivalence for Functional Data 16 In the case s > 1, it follows from standard results that H 2 (N( f(x i ), I m ), N( f(i/m), I m )) m (f(x i ) f(i/m)) 2 i=1 f 2 x i i/m 2. Hence if the design in conjunction with Θ n satisfy sup f Θn f x i i/m 0, then the triangle inequlity gives (E m,n, G m,n ) 0 and equivalence holds for nonequispaced designs satisfying these conditions. The design condition x i i/m 0 is satisfied if sup i m x i i/m = O(m (ɛ+1/2) ) while for s > 1, f may be bounded by sup f Θn f H s. Thus for the specified design condition, if the latter quantity is o(m ɛ ) as m, n, as for R n = O(log n) when n is bounded by a polynomial power of m, the equivalence will hold.

25 Chapter 3 Stein Estimation and Oracle Inequalities Having justified studying in place of 1.0.1, we move forward in this direction. The basic procedure is to fix a basis {ψ k } k=1 of L2 [0, 1] and block the corresponding noisy Fourier coefficients obtained from across individuals, frequency by frequency, to form the vectors Y k = (Y 1 (ψ k ),..., Y n (ψ k )) T, for k N. (3.0.1) We then estimate ξ k = (ξ 1k,..., ξ nk ) T, for k = 1,..., m, by Stein-type estimation ˆξ k = α mn,k Y k where α mn,k = and construct corresponding estimates of the f i by ( ) n/m 1 c n,m, (3.0.2) Y k 2 + ˆf i = m ˆξ ik ψ k. (3.0.3) k=1 Here the constants c n,m = 1 + o(1) as m, n are chosen to improve the estimator by, with high probability, forcing α mn,k = 0 for most frequencies k with 17

26 Chapter 3. Stein Estimation and Oracle Inequalities 18 small signal to noise ratio (SNR). Properties of these estimators are explored by an oracle inequality approach. Standard oracle inequalities for Stein estimation in this context, as in e.g. Candes [11], Johnstone [26] and Tsybakov [43], bound the total risks E ξk ˆξ k ξ k 2. Although this works well in the classical setting where total risk contributes to MISE, this is not the case in the current application. This is because we need the conditional MSE s of estimating ξ ik by ˆξ ik to calculate E fi ˆf i f i 2 L 2 by and it is unclear how to extrapolate this from classical oracle inequalities. One of the novel contributions of this thesis is to derive new oracle inequalities for Stein estimation in this setting. To this end, we employ conditional concentration of measure to derive oracle inequalities which hold simultaneously and with high probability. This is presented first in the simplified setting of bounding MSE for the components of a single ˆξ k. Results are then lifted to bound the E fi ˆf i f i 2 L 2 and requirements on {ψ k } k=1 in relation to the process generating the f i are discussed. Another novel point is that we are able to recover the f i optimally without directly estimating the covariance structure, C, as is standard practice in FDA approaches. Moving forward, we simplify notation, replacing m 1/2 by τ, and study the problem of estimating the ξ k from and under the distributional assumptions of the introduction. Before presenting risk properties for this estimation strategy, we provide some motivation for the estimator. First consider the case where c n,m = 1. The motivation behind this estimator is that for large n the distribution of Y k 2 concentrates heavily at n(λ k +τ 2 ), in which case ˆξ k ˆξ o k = λ k λ k + τ 2 Y k. This is the linear estimator formed from Y k that an oracle seeking to minimize E(ˆξ ik ξ ik ) 2, i = 1,..., n, would use. On average we expect this estimator to perform near optimally in terms of componentwise MSE. In fact, if we were to use this oracle

27 Chapter 3. Stein Estimation and Oracle Inequalities 19 strategy, ˆξ o k, to estimate ξ we would find conditional MSE s, R i,k = E fi (ˆξ o ik ξ ik) 2, of R o i,k = λ kτ 2 λ k + τ 2 + τ 4 (λ k + τ 2 ) 2 (ξ2 ik λ k ) for i = 1,..., n. (3.0.4) Using these oracle weights in conjunction with concentration of measure for quadratic forms we find, given mild assumptions on the covariances Cψ j, ψ k, that this oracle would incur a risk of estimation as τ 0. R o i := E fi ˆf o i f i 2 L 2 = (1 + o a.s.(1)) k=1 λ k τ 2 λ k + τ 2, (3.0.5) If we naively take c m,n = 1, this estimation strategy runs into problems. On a set of high probability we have (1 δ)(λ k + τ 2 ) Y k 2 /n (1 + δ)(λ k + τ 2 ), for k = 1,..., m, when δ (log m log n/n) 1/2 and taking c n,m = 1, the best we can say is that on this event α mn,k comes within an additive factor of δ of the optimal value of λ k /(λ k + τ 2 ). This approach yields sub-optimal risk bounds in the low signal to noise ratio (SNR) regime. Nevertheless, when λ k << τ 2, λ k /(λ k + τ 2 ) may be significantly closer to 0 than δ and in this case we might do better by choosing c m,n to guarantee that α mn,k = 0 with high probability. In fact, by taking c n,m = (1 + 2δ) = 1 + o(1) we find that with high probability we are estimating the ξ ik as 0 whenever λ k δτ 2 /(1 + δ). Given the model assumptions this condition defines a threshold β mn given by ( ) 1/4(r+1) n β mn m 1/2(r+1), log m log n which is a slight inflation of the truncation point that a projection oracle would choose! For frequencies k β mn we have a guarantee that with high probability we are estimating the ξ ik as 0 and this allows us to significantly improve the risk properties of our estimator from those for c m,n = 1.

28 Chapter 3. Stein Estimation and Oracle Inequalities 20 In relation to this we have the following theorem connecting the risk of Stein estimation to the conditional oracle risks. Theorem For ξ ik = f i, ψ k, λ k = Var( f 1, ψ k ) and ξ k as above, we have ξ k N n (0, λ k I n ) and Y k N n (ξ k, τ 2 I n ). For δ (0, 1/2) set A δ = m k=1 A k,δ with A k,δ defined by complement through the relation A c k,δ = {(1 δ)n(λ k + τ 2 ) Y k 2 (1 + δ)n(λ k + τ 2 )}. Now take c n,m = 1 + 2δ and let ˆξ k denote the Stein estimates from Then for all i = 1,..., n, k = 1,..., m it holds that E fi (ˆξ ik ξ ik ) 2 R o ik + e ik, (3.0.6) where ( ) [ ] e ik = max 1, ξ2 ik C δ min(λ k, δτ 2 ) + C τ P 1/2 i (A δ )(λ k + τ 2 ). (3.0.7) λ k Here C δ 3(6 + δ)(1 + 3δ/(1 δ))/(1 δ), C τ = ( ) are both bounded constants and P i ( ) = P( f i ) is the probability measure conditioned on f i. Further, the probabilities in satisfy the bounds P i (A δ ) 3 exp(δ max k m ξ2 ik/2λ k ) exp( nδ 2 /6 + log m). (3.0.8) Before the proof, some comments are in order. Although this result has a similar feel to standard oracle inequalities for stein estimation, the distributional assumptions of the problem at hand allow us to make quantitative assertions about the recovery of the individual effects and this seems both genuinely new and applicable to any scenario where similar distributional assumptions may be in play. Further, although

29 Chapter 3. Stein Estimation and Oracle Inequalities 21 Gaussianity simplifies the statement of the problem and corresponding calculations, the key property resulting from this assumption is sub-gaussianity. This implies a Hanson-Wright style inequality for concentration of quadratic forms of the Fourier coefficients which is used in the proof of the theorem and the derivation of subsequent results. The other key point is that it is possible to make the theory deal with harsher error distributions; for instance, with finite second moments one can construct sub- Gaussian estimators of mean and this allows to carry the theory over to much harsher error distributions with small amendments to the estimator. Sub-Gaussianity (or near this) will always be required of the underlying functional distribution in lifting the results to bound the error of L 2 recovery, but we do not feel that this is overly restrictive. Requiring decay of the form P( L(f) > t) K exp( Ct b ), b > 0 for all linear functionals of the underlying process ensures that we are taking samples from a function class which is effectively specified by the variances, λ k, of the fourier coefficients. In this case the rescaled maximal fourier coefficients for any given sample path grow like (log k) 1/b keeping the size of the k-th Fourier coefficient to within logarithmic factors of λ 1/2 k the sample paths. and so these are seen to tell us about the regularity of If instead we assumed decay of the form P( L(f) > t) Kt b, b > 0 then the corresponding rescaled fourier coefficients would be growing like k 1/b and we are effectively sampling from another regularity class. We might have achieved this with Fourier coefficients having variances decaying like k 2/b λ k and tighter concentration for functionals of the process. proof of theorem We first establish the inequality in the case where λ k > δτ 2 /(1+ δ). We may write the Stein estimator of ξ ik as ˆξ ik = α n,k y ik = λ ( k λ k + τ y 2 ik + α n,k λ ) k y λ k + τ 2 ik. Using that y ik = ξ ik +τz ik with z ik N(0, 1) allows us to write λ k y ik /(λ k +τ 2 ) ξ ik =

30 Chapter 3. Stein Estimation and Oracle Inequalities 22 (λ k τz ik τ 2 ξ ik )/(λ k + τ 2 ) and we find ( E fi (ˆξ ik ξ ik ) 2 = R o i,k + E fi α n,k λ ) 2 k y 2 λ k + τ 2 ik }{{} I ( ) ( λk τz ik τ 2 ξ ik +2 E fi α λ k + τ 2 n,k λ ) k y λ k + τ 2 ik. }{{} II We proceed by bounding the terms I and II. Now δ (0, 1/2) and on the event A c δ the norm Y k satisfies the bounds (1 δ)n(λ k + τ 2 ) Y k 2 (1 + δ)n(λ k + τ 2 ), which gives that on A c δ α n,k λ k λ k + τ 2 3δ 1 δ τ 2 λ k + τ = C δτ 2 2 δ λ k + τ, 2 where we have fixed C δ = 3/(1 δ). Since both α n,k and λ k /(λ k + τ 2 ) lie in the interval (0, 1), this quantity is always bounded by 2. Using that E fi y 2 ik = ξ2 ik + τ 2, τ 2 /(λ k + τ 2 ) 1 gives ( E fi α n,k λ ) 2 k y 2 λ k + τ 2 ( ) ξ ik1 A c δ Cδ 2 δ 2 τ 2 2 ik + τ 2 λ k + τ 2 C 2 δ δ 2 τ 2 max ( 1, ξ2 ik λ k ). Further, we have y 4 ik 8(ξ4 ik + τ 4 z 4 ik ) which gives (E f i y 4 i ) 1/2 (8(ξ 4 i + 3τ 4 )) 1/2 24(ξ 2 ik +τ 2 ) while writing ξ 2 ik +τ 2 = (ξ 2 ik /λ k)λ k +τ 2 gives that ξ 2 ik +τ 2 max(1, ξ 2 ik /λ k)(λ k + τ 2 ). In the range under consideration, δτ 2 /(1 + δ) = min(λ k, δτ 2 /(1 + δ)) implies δτ 2 (1 + δ) min(λ k, δτ 2 ) and so an application of lemma yields ( ) ( I max 1, ξ2 ik Cδ 2 δ(1 + δ) min(λ k, δτ 2 ) + ) 24P 1/2 i (A δ )(λ k + τ 2 ). λ k It remains to bound the final term in the expression for E fi (ˆξ ik ξ ik ) 2. We

31 Chapter 3. Stein Estimation and Oracle Inequalities 23 begin by writing (λ k τz ik τ 2 ξ ik )y ik = (λ k τz ik τ 2 ξ ik )(ξ ik + τz ik ) and expand to get (λ k τz ik τ 2 ξ ik )y ik = λ k τ 2 z 2 ik τ 2 ξ 2 ik + (λ kτ τ 3 )z ik ξ ik. The final term to bound may be now be written ( II = E fi α n,k λ k λ k + τ 2 ) ( λk τ 2 z 2 ik τ 2 ξ 2 ik + (λ kτ τ 3 )z ik ξ ik λ k + τ 2 We pass the expectation through and bound this quantity term by term. For the first term, using that E fi z 2 i 1 Aδ E fi z 2 i = 1, we get a bound of ( E fi α n,k λ ) k λk τ 2 zik 2 λ k + τ 2 λ k + τ 1 δλ k τ 4 2 A δ C δ (λ k + τ 2 ) C δλ k τ 2 2 δ λ k + τ. 2 Similarly, for the second term we find that τ ( 2 ξik 2 λ k + τ E 2 f i α n,k λ ) k δξik 2 1 λ k + τ 2 Aδ C τ 4 δ (λ k + τ 2 ) C 2 δ Finally, we may write E fi ( α n,k λ ) ( k z λ k + τ 2 ik 1 Aδ = E fi α n,k ( ξ 2 ik λ k ). ) δλk τ 2 λ k + τ 2. λ ) k z λ k + τ 2 ik (1 zik <0 + 1 zik 0)1 Aδ and use that E fi z ik 1 zik 01 Aδ and E fi z ik 1 zik <01 Aδ are both bounded by E fi z ik 1 zik 0 = (2π) 1/2 while 2(2π) 1/2 1 to arrive at a bound of ( E fi α n,k λ ) k δτ 2 z λ k + τ 2 ik 1 Aδ C δ λ k + τ. 2 Arguing in the same way, we find a lower bound of C δ δτ 2 /(λ k + τ 2 ). Now if a, b are arbitrary numbers with b B and c and d are positive numbers, then a b (c d) = a b (max(c, d) min(c, d)) a B max(c, d). Using this, we find that ( (λ k τ τ 3 )ξ ik E λ k + τ 2 fi α n,k λ ) k δ ξ ik τ 3 max(λ k, τ 2 ) z λ k + τ 2 ik 1 Aδ C δ. (λ k + τ 2 ) 2

32 Chapter 3. Stein Estimation and Oracle Inequalities 24 Now observe that for any α (0, 2), since ab (a 2 + b 2 )/2, we have that 2δ 1 α/2 τ δα/2 ξ ik τ 2 λ k + τ 2 δ 2 α τ 2 + δ α ξ 2 ik τ 4 ( ξ (λ k + τ 2 ) 2 δ2 α τ 2 + δ α 2 ik λ k ) λk τ 2 λ k + τ 2. We observe that (λ k τz ik τ 2 ξ ik ) 2 2(λ 2 k τ 2 z 2 ik + τ 4 ξ 2 ik ) and that y2 ik 2(τ 2 z 2 ik + ξ2 ik ). Then expanding (λ k τz ik τ 2 ξ ik ) 2 y 2 ik and noting that E f i z 4 ik = 3 we arrive at the bound E fi (λ k τz ik τ 2 ξ ik ) 2 y 2 ik 12(λ2 k τ 2 +τ 4 ξ 2 ik )(τ 2 +ξ 2 ik ) 12λ kτ 2 (max(1, ξ 2 ik /λ k)(λ k +τ 2 )) 2, which gives the bound ( ) λk τz ii τ 2 2 ξ ik E fi y 2 λ k + τ 2 ik 12(max(1, ξik/λ 2 k )(λ k + τ 2 )) 2. An application of lemma gives ( ) [ 2II max 1, ξ2 ik C δ (4δ + δ α ) λ kτ 2 λ k λ k + τ 2 +C δ δ 2 α τ P 1/2 i (A δ )(λ k + τ 2 ) ] For δ (0, 1/2), λ k δτ 2 /(λ k + τ 2 ) λ k δτ 2 /(λ k + δτ 2 ) min(λ k, δτ 2 ), while in the range under consideration, δτ 2 (1 + δ) min(λ k, δτ 2 ). Taking α = 1, this reduces to ( ) [ 2II max 1, ξ2 ik 6C δ (1 + δ) min(λ k, δτ 2 ) + 4 ] 12P 1/2 i (A δ )(λ k + τ 2 ). λ k Combining bounds for terms I and II gives the bound ( ) [ E fi (ˆξ ik ξ ik ) 2 R o i,k + max 1, ξ2 ik C δ min(λ k, δτ 2 ) λ k +C τ P 1/2 i (A δ )(λ k + τ 2 ) ],

33 Chapter 3. Stein Estimation and Oracle Inequalities 25 where C δ = (6 + δ + δ(1 + δ)c δ) C δ and C τ = ( ) and this provides the bound of the theorem in the case that λ k > δτ 2. In the case that λ k δτ 2 /(1 + δ), min(λ k, δτ 2 /(1 + δ)) = λ k and we have that α n,k = 0 on the event A δ which gives that E fi (ˆξ ik ξ ik ) 2 1 Aδ ξik 2. We also have E fi (ˆξ ik ξ ik ) 4 4E fi (ξ 2 ik +τ2 z 2 ik )2 24(ξ 4 ik +τ4 ) and so an application of lemma gives E fi (ˆξ ik ξ ik ) 2 ξik P 1/2 i (A δ )(ξik 2 + τ 2 ) ( ) [ max 1, ξ2 ik min(λ k, δτ 2 ) + ] 24P 1/2 i (A δ )(λ k + τ 2 ) λ k and this implies the bound of the theorem in the second range. Finally, noticing that lemma gives P i (A k,δ ) 3 exp(δξ 2 i,k /2λ k) exp( nδ 2 /6) while δξ 2 i,k /2λ k max k m ξ 2 i,k /2λ k, a union bound concludes the proof.

34 Chapter 4 Minimax Recovery of Functional Data We now lift the results of theorem to our estimation problem and draw a connection to the classical theory. This is done by considering the smallest collection of parameter spaces F m,n which contains the f 1,..., f n almost surely as m, n. For any f F m,n we may consider the white noise experiment Y (dt) = f(t)dt + m 1/2 B(dt). In relation to this parameter space, we show that the estimator is nearly minimax rate adaptive over F m,n and performs below the minimax rate for any function drawn from a similar distribution to that generating the f i. This gives our procedure the flavour of a minimax robust strategy which may have applications beyond recovery of the f i. The procedure is to estimate ξ k by for k m and 0 for k > m. Then with 26

35 Chapter 4. Minimax Recovery of Functional Data 27 ˆf i as in 3.0.3, isometry gives E fi ˆf i f i 2 L 2 = m k=1 E fi (ˆξ i,k ξ i,k ) 2 + k>m which we may bound using the oracle inequality from Theorem and concentration of measure. Theorem bounds the risks E fi (ˆξ i,k ξ i,k ) 2 by R i,k + e i,k for i = 1,..., n and k = 1,..., m. With R i = m k=1 R i,k and e i = m k=1 e i,k, applying the theorem gives that ξ 2 i,k, E fi ˆf i f i 2 L 2 R i + e i + k>m ξ 2 i,k, i = 1,..., n. (4.0.1) We shall examine and bound the terms on the right hand side separately but first outline and discuss assumptions required for this task. For the bound on the R i we begin by noticing that these terms may be split into random and deterministic components as R i = m k=1 λ k τ 2 λ k + τ 2 + ζ i,m, where ζ i,m = m k=1 τ 4 (λ k + τ 2 ) 2 (ξ2 ik λ k ). Although the τ 4 (ξik 2 λ k)/(λ k + τ 2 ) 2 may force R i,k to be quite different from the optimal risk λ k τ 2 /(λ k + τ 2 ) from one value of k to another, cancellations ensure that the ξ i,m concentrate heavily at 0 under relatively mild correlation assumptions. This allows us to place this first part near the optimal risk with high probability. In fact, noting that C = Ef 1 f 1 gives Cov(ξ i,j, ξ i,k ) = ψ j, Cψ k and so the variables x i,k = τ 2 ξ i,k /(λ k +τ 2 ) are mean 0 gaussian with covariance Q ξ given termwise by Q ξ,ij = ψ i, Cψ j τ 4 (λ i + τ 2 )(λ j + τ 2 ). (4.0.2) Following the notation of lemma 7.0.4, we have ζ i,m = x i 2 T r(q ξ ), and so with

36 Chapter 4. Minimax Recovery of Functional Data 28 γ(q ξ ) = T r(q 2 ξ )/(T r(q ξ)) 2 we may apply the concentration result of the lemma. Strong concentration relies on the ratio γ(q ξ ) being small which, in turn, requires assumptions on the correlation between Fourier coefficients in the basis used. Assumption We make the following assumptions regarding the relation of {ψ k } k=1 to C. i.) The variances of the fourier coefficients decay at the Karhunen-Lòeve rate so that, given the Sacks-Ylvisacker assumption of order r, λ k = ψ k, Cψ k k 2(r+1). ii.) The correlations between the Fourier coefficients decay at a reasonable rate with distance between indices in that ψ i, Cψ j (ij) (r+1), (4.0.3) (1 + i j ) β/2 for some β > 1. We make a couple of points regarding these assumptions. For the first assumption, covariance functions which satisfy the Sacks-Ylvisacker conditions of order r generate RKHS s which lie within a polynomial translation of the Sobolev space H r+1 ([0, 1]) and the sample paths of the process just about lie in these (They lie in H ν ([0, 1]) for ν < r + 1/2 and fill this out as we repeatedly sample). There are many comparable smoothness classes which share similar decay of Fourier coefficients when expressed in bases which efficiently represent them and it is reasonable to expect that in such bases the variances of the Fourier coefficients decay at the Karhunen-Lóeve rate. For the second assumption, Cauchy-Schwarz bounds the covariances between Fourier coefficients by ψ i, Cψ j (λ i λ j ) 1/2 (ij) (r+1). This assumption is telling us that when all variances are blown up to the same scale, the correlations between coefficients are bounded by a stationary covariance structure with weak decay. This

37 Chapter 4. Minimax Recovery of Functional Data 29 is also quite reasonable, especially in the case where {ψ k } k=1 are taken to be wavelets. At the right scale, such off diagonal decay conditions hold for broad classes of differential oparators in wavelet bases, [14], in which case the same off diagonal decay also holds for the inverses of these operators and these naturally correspond to covariance operators, [38]. With this assumption, we are able to prove the following bounds for γ(q ξ ): Lemma Suppose that C satisfies the Sacks-Ylvisacker conditions of order r and in {ψ k } k=1, λ k = ψ k, Cψ k k 2(r+1) and the ψ i, Cψ j satisfies the decay conditions Suppose further that τ 0 and m in such a way that m 1 = o(τ 1/(r+1) ). Then it follows that γ(q ξ ) τ 1/(r+1) This allows us to establish the following theorem for the purpose of bounding Theorem Suppose that τ 2 0 and n, m so τ 2 m 1 and m ς n m υ for ς, υ > 0. Then for δ = (12κ log n log m/n) 1/2 and with the assumptions listed above on the correlations between the ξ k, it follows that on a set of probability at least 1 + O(n 2 ) the terms bounding the E fi ˆf i f i 2 L satisfy the following inequalities: 2 i.) max i n R i (1 + o(1)) k=1 λ k τ 2 λ k + τ 2 m (2r+1)/(2r+2). ii.) max i n P1/2 i (A δ ) exp(o(1))m κ log n+1/2.

38 Chapter 4. Minimax Recovery of Functional Data 30 iii.) max i n e i 6 log(nm) log(mn) [ C δ k=1 min(λ k, δτ 2 ) +C τ (1 + E f 2 2) max i n P1/2 i (A δ ) ] [δ (2r+1)/(2r+2) m (2r+1)/(2r+2) + m κ log n+1/2 ]. iv.) max i n k>m ξ 2 i,k = (1 + o(1)) k>m λ k m (2r+1). Given the assumptions, δ (2r+1)/(2r+2) log(nm) = o(1) and also m log n+1/2 log(mn) = o(m (2r+1)/(2r+2) ). Thus as m, n, max i n E f i ˆf i f i 2 2 = (1 + o a.s. (1)) which is the oracle rate. k=1 λ k τ 2 λ k + τ 2 m (2r+1)/(2r+2), Before the proof, some remarks. This rate is known to be optimal for L 2 reconstruction of f, in the sense of averaged error, when one has m noisy samples of f at arbitrary points in [0, 1]. See e.g. [38], chapter 5, proposition 3. Whereas the optimal estimator in [38] relies on knowing C, and thus on an oracle s knowledge of the underlying model, we are able to achieve this rate from the data alone. Further, conditionally on the observed trajectories, our rates are shown to hold with high probability which provides sample to sample guarantees that averaged bounds do not. It is not totally transparent from iii.) of theorem 4.0.3, but there is a tradeoff between the terms δ (2r+1)/(2r+2) m (2r+1)/(2r+2) and m κ log n+1/2. Larger δ puts more weight on the factor m (2r+1)/(2r+2) and shrinks m κ log n+1/2 by placing a larger κ in

39 Chapter 4. Minimax Recovery of Functional Data 31 the exponent. Ideally we would like to balance the two terms which happens when we choose δ as small as possible. Taking κ = 3/(2 log n) gives δ = (18 log m/n) 1/2 and m κ log n+1/2 = m 1 = o(m (2r+1)/(2r+2) ), which is about as small as we can take δ without knowing anything about r. In fact with slightly worse constants of proportionality (which still have the order 1±δ), the probabilities used in our proofs concentrate like exp( nδ 2 /4) which means the m κ log n+1/2 3κ log n/2+1/2 should look like m and this justifies taking δ = (12 log m/n) 1/2 for practical implementation. Proof of theorem Let the event D α be defined as D α = { } max ζ i,m > α T r(q ξ ). i n Then choosing α so that cα = 3τ ϕ/2 log n, lemma gives P (D α ) 2n 2. Now λ k τ 4 /(λ k + τ 2 ) 2 λ k τ 2 /(λ k + τ 2 ) gives that T r(q ζ ) m k=1 λ kτ 2 /(λ k + τ 2 ) and so on D c α, we have max i n R i (1 + α) n k=1 λ k τ 2 λ k + τ 2 which satisfies the desired bound as long as α = o(1) as m. In the case that τ 2 m 1 this happens as long as n is bounded by a power of m, which holds by assumption. For γ > 2, on an event of probability 1 (nm) 1 γ2 /2, it holds that max i n,k m λ 1/2 k ξ ik γ(log(nm)) 1/2. In this case we find that the P i (A k,δ ) satisfy P i (A k,δ ) 8 exp(δγ 2 log(nm)/2) exp( nδ 2 /6). By assumption δ log(nm) n 1/2 log n = o(1) and so taking δ = (12 log n/n) 1/2 gives

40 Chapter 4. Minimax Recovery of Functional Data 32 the quoted bound for the P 1/2 i (A k,δ ). Also, on this event, the bound for the e i,k from theorem reduces to e ik = γ 2 log(nm) [ C δ min(λ k, δτ 2 ) + CP 1/2 i (A δ )(λ k + τ 2 ) ]. This sums to yield the bound of (iii). Finally, we take E α to be the event E α = { } max η i,m > α T r(q η ). i n and notice that by lemma 7.0.4, choosing α so that cα = 3m 1/2 log n gives P(E α ) 2n 2. By assumption α n 1/(2γ2) log n = o(1) and so on Eα c the third bound of the theorem holds. A union bound gives that with probability at least 1 4n 2 (nm) 1 γ2 /2 the bounds of the theorem hold with the quoted probability for γ 2 6. We now shift focus to constructing a collection of parameter spaces over which this estimation procedure has a minimax interpretation. With a basis {ψ k } k=1 satisfying the covariance conditions and m, n as in Theorem we define a collection of parameter spaces, F m,n, by F m,n = { f L 2 : f, ψ k a(λ k log(mn)) 1/2 for k m and f, ψ k b(λ k log(nk)) 1/2 for k > m }, (4.0.4) with a, b to be determined. The idea is to find the smallest a, b to guaranteeing that f 1,..., f n F m,n as m, n thus keeping the collection as small as possible.

From Multiple Gaussian Sequences to Functional Data and Beyond: A Stein Estimation Approach

From Multiple Gaussian Sequences to Functional Data and Beyond: A Stein Estimation Approach From Multiple Gaussian Sequences to Functional Data and Beyond: A Stein Estimation Approach Mark Koudstaal Department of Statistical Sciences, University of Toronto, Toronto, Ontario, Canada Fang Yao Department

More information

Compressibility of Infinite Sequences and its Interplay with Compressed Sensing Recovery

Compressibility of Infinite Sequences and its Interplay with Compressed Sensing Recovery Compressibility of Infinite Sequences and its Interplay with Compressed Sensing Recovery Jorge F. Silva and Eduardo Pavez Department of Electrical Engineering Information and Decision Systems Group Universidad

More information

Can we do statistical inference in a non-asymptotic way? 1

Can we do statistical inference in a non-asymptotic way? 1 Can we do statistical inference in a non-asymptotic way? 1 Guang Cheng 2 Statistics@Purdue www.science.purdue.edu/bigdata/ ONR Review Meeting@Duke Oct 11, 2017 1 Acknowledge NSF, ONR and Simons Foundation.

More information

Model Selection and Geometry

Model Selection and Geometry Model Selection and Geometry Pascal Massart Université Paris-Sud, Orsay Leipzig, February Purpose of the talk! Concentration of measure plays a fundamental role in the theory of model selection! Model

More information

Homework # , Spring Due 14 May Convergence of the empirical CDF, uniform samples

Homework # , Spring Due 14 May Convergence of the empirical CDF, uniform samples Homework #3 36-754, Spring 27 Due 14 May 27 1 Convergence of the empirical CDF, uniform samples In this problem and the next, X i are IID samples on the real line, with cumulative distribution function

More information

Discussion of Regularization of Wavelets Approximations by A. Antoniadis and J. Fan

Discussion of Regularization of Wavelets Approximations by A. Antoniadis and J. Fan Discussion of Regularization of Wavelets Approximations by A. Antoniadis and J. Fan T. Tony Cai Department of Statistics The Wharton School University of Pennsylvania Professors Antoniadis and Fan are

More information

Statistics 612: L p spaces, metrics on spaces of probabilites, and connections to estimation

Statistics 612: L p spaces, metrics on spaces of probabilites, and connections to estimation Statistics 62: L p spaces, metrics on spaces of probabilites, and connections to estimation Moulinath Banerjee December 6, 2006 L p spaces and Hilbert spaces We first formally define L p spaces. Consider

More information

ASYMPTOTIC EQUIVALENCE OF DENSITY ESTIMATION AND GAUSSIAN WHITE NOISE. By Michael Nussbaum Weierstrass Institute, Berlin

ASYMPTOTIC EQUIVALENCE OF DENSITY ESTIMATION AND GAUSSIAN WHITE NOISE. By Michael Nussbaum Weierstrass Institute, Berlin The Annals of Statistics 1996, Vol. 4, No. 6, 399 430 ASYMPTOTIC EQUIVALENCE OF DENSITY ESTIMATION AND GAUSSIAN WHITE NOISE By Michael Nussbaum Weierstrass Institute, Berlin Signal recovery in Gaussian

More information

Lecture 35: December The fundamental statistical distances

Lecture 35: December The fundamental statistical distances 36-705: Intermediate Statistics Fall 207 Lecturer: Siva Balakrishnan Lecture 35: December 4 Today we will discuss distances and metrics between distributions that are useful in statistics. I will be lose

More information

Lecture 7 Introduction to Statistical Decision Theory

Lecture 7 Introduction to Statistical Decision Theory Lecture 7 Introduction to Statistical Decision Theory I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 20, 2016 1 / 55 I-Hsiang Wang IT Lecture 7

More information

Statistical inference on Lévy processes

Statistical inference on Lévy processes Alberto Coca Cabrero University of Cambridge - CCA Supervisors: Dr. Richard Nickl and Professor L.C.G.Rogers Funded by Fundación Mutua Madrileña and EPSRC MASDOC/CCA student workshop 2013 26th March Outline

More information

CS168: The Modern Algorithmic Toolbox Lecture #6: Regularization

CS168: The Modern Algorithmic Toolbox Lecture #6: Regularization CS168: The Modern Algorithmic Toolbox Lecture #6: Regularization Tim Roughgarden & Gregory Valiant April 18, 2018 1 The Context and Intuition behind Regularization Given a dataset, and some class of models

More information

Peter Hoff Minimax estimation October 31, Motivation and definition. 2 Least favorable prior 3. 3 Least favorable prior sequence 11

Peter Hoff Minimax estimation October 31, Motivation and definition. 2 Least favorable prior 3. 3 Least favorable prior sequence 11 Contents 1 Motivation and definition 1 2 Least favorable prior 3 3 Least favorable prior sequence 11 4 Nonparametric problems 15 5 Minimax and admissibility 18 6 Superefficiency and sparsity 19 Most of

More information

Statistics: Learning models from data

Statistics: Learning models from data DS-GA 1002 Lecture notes 5 October 19, 2015 Statistics: Learning models from data Learning models from data that are assumed to be generated probabilistically from a certain unknown distribution is a crucial

More information

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013 Learning Theory Ingo Steinwart University of Stuttgart September 4, 2013 Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 1 / 62 Basics Informal Introduction Informal Description

More information

1.1 Basis of Statistical Decision Theory

1.1 Basis of Statistical Decision Theory ECE598: Information-theoretic methods in high-dimensional statistics Spring 2016 Lecture 1: Introduction Lecturer: Yihong Wu Scribe: AmirEmad Ghassami, Jan 21, 2016 [Ed. Jan 31] Outline: Introduction of

More information

Inverse problems in statistics

Inverse problems in statistics Inverse problems in statistics Laurent Cavalier (Université Aix-Marseille 1, France) Yale, May 2 2011 p. 1/35 Introduction There exist many fields where inverse problems appear Astronomy (Hubble satellite).

More information

9 Brownian Motion: Construction

9 Brownian Motion: Construction 9 Brownian Motion: Construction 9.1 Definition and Heuristics The central limit theorem states that the standard Gaussian distribution arises as the weak limit of the rescaled partial sums S n / p n of

More information

Convergence in shape of Steiner symmetrized line segments. Arthur Korneychuk

Convergence in shape of Steiner symmetrized line segments. Arthur Korneychuk Convergence in shape of Steiner symmetrized line segments by Arthur Korneychuk A thesis submitted in conformity with the requirements for the degree of Master of Science Graduate Department of Mathematics

More information

Introduction Wavelet shrinage methods have been very successful in nonparametric regression. But so far most of the wavelet regression methods have be

Introduction Wavelet shrinage methods have been very successful in nonparametric regression. But so far most of the wavelet regression methods have be Wavelet Estimation For Samples With Random Uniform Design T. Tony Cai Department of Statistics, Purdue University Lawrence D. Brown Department of Statistics, University of Pennsylvania Abstract We show

More information

21.2 Example 1 : Non-parametric regression in Mean Integrated Square Error Density Estimation (L 2 2 risk)

21.2 Example 1 : Non-parametric regression in Mean Integrated Square Error Density Estimation (L 2 2 risk) 10-704: Information Processing and Learning Spring 2015 Lecture 21: Examples of Lower Bounds and Assouad s Method Lecturer: Akshay Krishnamurthy Scribes: Soumya Batra Note: LaTeX template courtesy of UC

More information

Class 2 & 3 Overfitting & Regularization

Class 2 & 3 Overfitting & Regularization Class 2 & 3 Overfitting & Regularization Carlo Ciliberto Department of Computer Science, UCL October 18, 2017 Last Class The goal of Statistical Learning Theory is to find a good estimator f n : X Y, approximating

More information

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Machine Learning. Lecture 9: Learning Theory. Feng Li. Machine Learning Lecture 9: Learning Theory Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Why Learning Theory How can we tell

More information

OPTIMAL POINTWISE ADAPTIVE METHODS IN NONPARAMETRIC ESTIMATION 1

OPTIMAL POINTWISE ADAPTIVE METHODS IN NONPARAMETRIC ESTIMATION 1 The Annals of Statistics 1997, Vol. 25, No. 6, 2512 2546 OPTIMAL POINTWISE ADAPTIVE METHODS IN NONPARAMETRIC ESTIMATION 1 By O. V. Lepski and V. G. Spokoiny Humboldt University and Weierstrass Institute

More information

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra.

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra. DS-GA 1002 Lecture notes 0 Fall 2016 Linear Algebra These notes provide a review of basic concepts in linear algebra. 1 Vector spaces You are no doubt familiar with vectors in R 2 or R 3, i.e. [ ] 1.1

More information

Statistical Measures of Uncertainty in Inverse Problems

Statistical Measures of Uncertainty in Inverse Problems Statistical Measures of Uncertainty in Inverse Problems Workshop on Uncertainty in Inverse Problems Institute for Mathematics and Its Applications Minneapolis, MN 19-26 April 2002 P.B. Stark Department

More information

The deterministic Lasso

The deterministic Lasso The deterministic Lasso Sara van de Geer Seminar für Statistik, ETH Zürich Abstract We study high-dimensional generalized linear models and empirical risk minimization using the Lasso An oracle inequality

More information

In terms of measures: Exercise 1. Existence of a Gaussian process: Theorem 2. Remark 3.

In terms of measures: Exercise 1. Existence of a Gaussian process: Theorem 2. Remark 3. 1. GAUSSIAN PROCESSES A Gaussian process on a set T is a collection of random variables X =(X t ) t T on a common probability space such that for any n 1 and any t 1,...,t n T, the vector (X(t 1 ),...,X(t

More information

MIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications. Class 19: Data Representation by Design

MIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications. Class 19: Data Representation by Design MIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications Class 19: Data Representation by Design What is data representation? Let X be a data-space X M (M) F (M) X A data representation

More information

Learning gradients: prescriptive models

Learning gradients: prescriptive models Department of Statistical Science Institute for Genome Sciences & Policy Department of Computer Science Duke University May 11, 2007 Relevant papers Learning Coordinate Covariances via Gradients. Sayan

More information

Optimal Estimation of a Nonsmooth Functional

Optimal Estimation of a Nonsmooth Functional Optimal Estimation of a Nonsmooth Functional T. Tony Cai Department of Statistics The Wharton School University of Pennsylvania http://stat.wharton.upenn.edu/ tcai Joint work with Mark Low 1 Question Suppose

More information

D I S C U S S I O N P A P E R

D I S C U S S I O N P A P E R I N S T I T U T D E S T A T I S T I Q U E B I O S T A T I S T I Q U E E T S C I E N C E S A C T U A R I E L L E S ( I S B A ) UNIVERSITÉ CATHOLIQUE DE LOUVAIN D I S C U S S I O N P A P E R 2014/06 Adaptive

More information

Asymptotically sufficient statistics in nonparametric regression experiments with correlated noise

Asymptotically sufficient statistics in nonparametric regression experiments with correlated noise Vol. 0 0000 1 0 Asymptotically sufficient statistics in nonparametric regression experiments with correlated noise Andrew V Carter University of California, Santa Barbara Santa Barbara, CA 93106-3110 e-mail:

More information

Peter Hoff Minimax estimation November 12, Motivation and definition. 2 Least favorable prior 3. 3 Least favorable prior sequence 11

Peter Hoff Minimax estimation November 12, Motivation and definition. 2 Least favorable prior 3. 3 Least favorable prior sequence 11 Contents 1 Motivation and definition 1 2 Least favorable prior 3 3 Least favorable prior sequence 11 4 Nonparametric problems 15 5 Minimax and admissibility 18 6 Superefficiency and sparsity 19 Most of

More information

Wavelet Shrinkage for Nonequispaced Samples

Wavelet Shrinkage for Nonequispaced Samples University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 1998 Wavelet Shrinkage for Nonequispaced Samples T. Tony Cai University of Pennsylvania Lawrence D. Brown University

More information

Cross-Validation with Confidence

Cross-Validation with Confidence Cross-Validation with Confidence Jing Lei Department of Statistics, Carnegie Mellon University UMN Statistics Seminar, Mar 30, 2017 Overview Parameter est. Model selection Point est. MLE, M-est.,... Cross-validation

More information

We are going to discuss what it means for a sequence to converge in three stages: First, we define what it means for a sequence to converge to zero

We are going to discuss what it means for a sequence to converge in three stages: First, we define what it means for a sequence to converge to zero Chapter Limits of Sequences Calculus Student: lim s n = 0 means the s n are getting closer and closer to zero but never gets there. Instructor: ARGHHHHH! Exercise. Think of a better response for the instructor.

More information

Reconstruction from Anisotropic Random Measurements

Reconstruction from Anisotropic Random Measurements Reconstruction from Anisotropic Random Measurements Mark Rudelson and Shuheng Zhou The University of Michigan, Ann Arbor Coding, Complexity, and Sparsity Workshop, 013 Ann Arbor, Michigan August 7, 013

More information

Approximation Theoretical Questions for SVMs

Approximation Theoretical Questions for SVMs Ingo Steinwart LA-UR 07-7056 October 20, 2007 Statistical Learning Theory: an Overview Support Vector Machines Informal Description of the Learning Goal X space of input samples Y space of labels, usually

More information

Empirical Processes: General Weak Convergence Theory

Empirical Processes: General Weak Convergence Theory Empirical Processes: General Weak Convergence Theory Moulinath Banerjee May 18, 2010 1 Extended Weak Convergence The lack of measurability of the empirical process with respect to the sigma-field generated

More information

Bayesian Nonparametric Point Estimation Under a Conjugate Prior

Bayesian Nonparametric Point Estimation Under a Conjugate Prior University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 5-15-2002 Bayesian Nonparametric Point Estimation Under a Conjugate Prior Xuefeng Li University of Pennsylvania Linda

More information

Metric Spaces and Topology

Metric Spaces and Topology Chapter 2 Metric Spaces and Topology From an engineering perspective, the most important way to construct a topology on a set is to define the topology in terms of a metric on the set. This approach underlies

More information

Gaussian Processes. 1. Basic Notions

Gaussian Processes. 1. Basic Notions Gaussian Processes 1. Basic Notions Let T be a set, and X : {X } T a stochastic process, defined on a suitable probability space (Ω P), that is indexed by T. Definition 1.1. We say that X is a Gaussian

More information

Bayesian Regularization

Bayesian Regularization Bayesian Regularization Aad van der Vaart Vrije Universiteit Amsterdam International Congress of Mathematicians Hyderabad, August 2010 Contents Introduction Abstract result Gaussian process priors Co-authors

More information

Lecture 8: Minimax Lower Bounds: LeCam, Fano, and Assouad

Lecture 8: Minimax Lower Bounds: LeCam, Fano, and Assouad 40.850: athematical Foundation of Big Data Analysis Spring 206 Lecture 8: inimax Lower Bounds: LeCam, Fano, and Assouad Lecturer: Fang Han arch 07 Disclaimer: These notes have not been subjected to the

More information

Nonparametric Inference In Functional Data

Nonparametric Inference In Functional Data Nonparametric Inference In Functional Data Zuofeng Shang Purdue University Joint work with Guang Cheng from Purdue Univ. An Example Consider the functional linear model: Y = α + where 1 0 X(t)β(t)dt +

More information

A talk on Oracle inequalities and regularization. by Sara van de Geer

A talk on Oracle inequalities and regularization. by Sara van de Geer A talk on Oracle inequalities and regularization by Sara van de Geer Workshop Regularization in Statistics Banff International Regularization Station September 6-11, 2003 Aim: to compare l 1 and other

More information

Stochastic Spectral Approaches to Bayesian Inference

Stochastic Spectral Approaches to Bayesian Inference Stochastic Spectral Approaches to Bayesian Inference Prof. Nathan L. Gibson Department of Mathematics Applied Mathematics and Computation Seminar March 4, 2011 Prof. Gibson (OSU) Spectral Approaches to

More information

U.C. Berkeley CS294: Spectral Methods and Expanders Handout 11 Luca Trevisan February 29, 2016

U.C. Berkeley CS294: Spectral Methods and Expanders Handout 11 Luca Trevisan February 29, 2016 U.C. Berkeley CS294: Spectral Methods and Expanders Handout Luca Trevisan February 29, 206 Lecture : ARV In which we introduce semi-definite programming and a semi-definite programming relaxation of sparsest

More information

Adaptive Piecewise Polynomial Estimation via Trend Filtering

Adaptive Piecewise Polynomial Estimation via Trend Filtering Adaptive Piecewise Polynomial Estimation via Trend Filtering Liubo Li, ShanShan Tu The Ohio State University li.2201@osu.edu, tu.162@osu.edu October 1, 2015 Liubo Li, ShanShan Tu (OSU) Trend Filtering

More information

STAT 200C: High-dimensional Statistics

STAT 200C: High-dimensional Statistics STAT 200C: High-dimensional Statistics Arash A. Amini May 30, 2018 1 / 57 Table of Contents 1 Sparse linear models Basis Pursuit and restricted null space property Sufficient conditions for RNS 2 / 57

More information

Lecture 3: Statistical Decision Theory (Part II)

Lecture 3: Statistical Decision Theory (Part II) Lecture 3: Statistical Decision Theory (Part II) Hao Helen Zhang Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 1 / 27 Outline of This Note Part I: Statistics Decision Theory (Classical

More information

1.5 Approximate Identities

1.5 Approximate Identities 38 1 The Fourier Transform on L 1 (R) which are dense subspaces of L p (R). On these domains, P : D P L p (R) and M : D M L p (R). Show, however, that P and M are unbounded even when restricted to these

More information

Cross-Validation with Confidence

Cross-Validation with Confidence Cross-Validation with Confidence Jing Lei Department of Statistics, Carnegie Mellon University WHOA-PSI Workshop, St Louis, 2017 Quotes from Day 1 and Day 2 Good model or pure model? Occam s razor We really

More information

THE LASSO, CORRELATED DESIGN, AND IMPROVED ORACLE INEQUALITIES. By Sara van de Geer and Johannes Lederer. ETH Zürich

THE LASSO, CORRELATED DESIGN, AND IMPROVED ORACLE INEQUALITIES. By Sara van de Geer and Johannes Lederer. ETH Zürich Submitted to the Annals of Applied Statistics arxiv: math.pr/0000000 THE LASSO, CORRELATED DESIGN, AND IMPROVED ORACLE INEQUALITIES By Sara van de Geer and Johannes Lederer ETH Zürich We study high-dimensional

More information

3. Some tools for the analysis of sequential strategies based on a Gaussian process prior

3. Some tools for the analysis of sequential strategies based on a Gaussian process prior 3. Some tools for the analysis of sequential strategies based on a Gaussian process prior E. Vazquez Computer experiments June 21-22, 2010, Paris 21 / 34 Function approximation with a Gaussian prior Aim:

More information

A General Overview of Parametric Estimation and Inference Techniques.

A General Overview of Parametric Estimation and Inference Techniques. A General Overview of Parametric Estimation and Inference Techniques. Moulinath Banerjee University of Michigan September 11, 2012 The object of statistical inference is to glean information about an underlying

More information

Minimax lower bounds I

Minimax lower bounds I Minimax lower bounds I Kyoung Hee Kim Sungshin University 1 Preliminaries 2 General strategy 3 Le Cam, 1973 4 Assouad, 1983 5 Appendix Setting Family of probability measures {P θ : θ Θ} on a sigma field

More information

(Part 1) High-dimensional statistics May / 41

(Part 1) High-dimensional statistics May / 41 Theory for the Lasso Recall the linear model Y i = p j=1 β j X (j) i + ɛ i, i = 1,..., n, or, in matrix notation, Y = Xβ + ɛ, To simplify, we assume that the design X is fixed, and that ɛ is N (0, σ 2

More information

Convex Optimization Notes

Convex Optimization Notes Convex Optimization Notes Jonathan Siegel January 2017 1 Convex Analysis This section is devoted to the study of convex functions f : B R {+ } and convex sets U B, for B a Banach space. The case of B =

More information

FORMULATION OF THE LEARNING PROBLEM

FORMULATION OF THE LEARNING PROBLEM FORMULTION OF THE LERNING PROBLEM MIM RGINSKY Now that we have seen an informal statement of the learning problem, as well as acquired some technical tools in the form of concentration inequalities, we

More information

Asymptotic Equivalence and Adaptive Estimation for Robust Nonparametric Regression

Asymptotic Equivalence and Adaptive Estimation for Robust Nonparametric Regression Asymptotic Equivalence and Adaptive Estimation for Robust Nonparametric Regression T. Tony Cai 1 and Harrison H. Zhou 2 University of Pennsylvania and Yale University Abstract Asymptotic equivalence theory

More information

2tdt 1 y = t2 + C y = which implies C = 1 and the solution is y = 1

2tdt 1 y = t2 + C y = which implies C = 1 and the solution is y = 1 Lectures - Week 11 General First Order ODEs & Numerical Methods for IVPs In general, nonlinear problems are much more difficult to solve than linear ones. Unfortunately many phenomena exhibit nonlinear

More information

SPECTRAL THEOREM FOR COMPACT SELF-ADJOINT OPERATORS

SPECTRAL THEOREM FOR COMPACT SELF-ADJOINT OPERATORS SPECTRAL THEOREM FOR COMPACT SELF-ADJOINT OPERATORS G. RAMESH Contents Introduction 1 1. Bounded Operators 1 1.3. Examples 3 2. Compact Operators 5 2.1. Properties 6 3. The Spectral Theorem 9 3.3. Self-adjoint

More information

A Statistical Analysis of Fukunaga Koontz Transform

A Statistical Analysis of Fukunaga Koontz Transform 1 A Statistical Analysis of Fukunaga Koontz Transform Xiaoming Huo Dr. Xiaoming Huo is an assistant professor at the School of Industrial and System Engineering of the Georgia Institute of Technology,

More information

sparse and low-rank tensor recovery Cubic-Sketching

sparse and low-rank tensor recovery Cubic-Sketching Sparse and Low-Ran Tensor Recovery via Cubic-Setching Guang Cheng Department of Statistics Purdue University www.science.purdue.edu/bigdata CCAM@Purdue Math Oct. 27, 2017 Joint wor with Botao Hao and Anru

More information

Asymptotic Nonequivalence of Nonparametric Experiments When the Smoothness Index is ½

Asymptotic Nonequivalence of Nonparametric Experiments When the Smoothness Index is ½ University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 1998 Asymptotic Nonequivalence of Nonparametric Experiments When the Smoothness Index is ½ Lawrence D. Brown University

More information

New Statistical Applications for Differential Privacy

New Statistical Applications for Differential Privacy New Statistical Applications for Differential Privacy Rob Hall 11/5/2012 Committee: Stephen Fienberg, Larry Wasserman, Alessandro Rinaldo, Adam Smith. rjhall@cs.cmu.edu http://www.cs.cmu.edu/~rjhall 1

More information

Mathematical Institute, University of Utrecht. The problem of estimating the mean of an observed Gaussian innite-dimensional vector

Mathematical Institute, University of Utrecht. The problem of estimating the mean of an observed Gaussian innite-dimensional vector On Minimax Filtering over Ellipsoids Eduard N. Belitser and Boris Y. Levit Mathematical Institute, University of Utrecht Budapestlaan 6, 3584 CD Utrecht, The Netherlands The problem of estimating the mean

More information

Asymptotic efficiency of simple decisions for the compound decision problem

Asymptotic efficiency of simple decisions for the compound decision problem Asymptotic efficiency of simple decisions for the compound decision problem Eitan Greenshtein and Ya acov Ritov Department of Statistical Sciences Duke University Durham, NC 27708-0251, USA e-mail: eitan.greenshtein@gmail.com

More information

Karhunen-Loève decomposition of Gaussian measures on Banach spaces

Karhunen-Loève decomposition of Gaussian measures on Banach spaces Karhunen-Loève decomposition of Gaussian measures on Banach spaces Jean-Charles Croix jean-charles.croix@emse.fr Génie Mathématique et Industriel (GMI) First workshop on Gaussian processes at Saint-Etienne

More information

Functional Latent Feature Models. With Single-Index Interaction

Functional Latent Feature Models. With Single-Index Interaction Generalized With Single-Index Interaction Department of Statistics Center for Statistical Bioinformatics Institute for Applied Mathematics and Computational Science Texas A&M University Naisyin Wang and

More information

L p -boundedness of the Hilbert transform

L p -boundedness of the Hilbert transform L p -boundedness of the Hilbert transform Kunal Narayan Chaudhury Abstract The Hilbert transform is essentially the only singular operator in one dimension. This undoubtedly makes it one of the the most

More information

Semi-Nonparametric Inferences for Massive Data

Semi-Nonparametric Inferences for Massive Data Semi-Nonparametric Inferences for Massive Data Guang Cheng 1 Department of Statistics Purdue University Statistics Seminar at NCSU October, 2015 1 Acknowledge NSF, Simons Foundation and ONR. A Joint Work

More information

Kernel Methods. Machine Learning A W VO

Kernel Methods. Machine Learning A W VO Kernel Methods Machine Learning A 708.063 07W VO Outline 1. Dual representation 2. The kernel concept 3. Properties of kernels 4. Examples of kernel machines Kernel PCA Support vector regression (Relevance

More information

Some functional (Hölderian) limit theorems and their applications (II)

Some functional (Hölderian) limit theorems and their applications (II) Some functional (Hölderian) limit theorems and their applications (II) Alfredas Račkauskas Vilnius University Outils Statistiques et Probabilistes pour la Finance Université de Rouen June 1 5, Rouen (Rouen

More information

Sparse Legendre expansions via l 1 minimization

Sparse Legendre expansions via l 1 minimization Sparse Legendre expansions via l 1 minimization Rachel Ward, Courant Institute, NYU Joint work with Holger Rauhut, Hausdorff Center for Mathematics, Bonn, Germany. June 8, 2010 Outline Sparse recovery

More information

Minimax Estimation of Kernel Mean Embeddings

Minimax Estimation of Kernel Mean Embeddings Minimax Estimation of Kernel Mean Embeddings Bharath K. Sriperumbudur Department of Statistics Pennsylvania State University Gatsby Computational Neuroscience Unit May 4, 2016 Collaborators Dr. Ilya Tolstikhin

More information

Understanding Generalization Error: Bounds and Decompositions

Understanding Generalization Error: Bounds and Decompositions CIS 520: Machine Learning Spring 2018: Lecture 11 Understanding Generalization Error: Bounds and Decompositions Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the

More information

Modeling with Itô Stochastic Differential Equations

Modeling with Itô Stochastic Differential Equations Modeling with Itô Stochastic Differential Equations 2.4-2.6 E. Allen presentation by T. Perälä 27.0.2009 Postgraduate seminar on applied mathematics 2009 Outline Hilbert Space of Stochastic Processes (

More information

Concentration behavior of the penalized least squares estimator

Concentration behavior of the penalized least squares estimator Concentration behavior of the penalized least squares estimator Penalized least squares behavior arxiv:1511.08698v2 [math.st] 19 Oct 2016 Alan Muro and Sara van de Geer {muro,geer}@stat.math.ethz.ch Seminar

More information

A Lower Bound Theorem. Lin Hu.

A Lower Bound Theorem. Lin Hu. American J. of Mathematics and Sciences Vol. 3, No -1,(January 014) Copyright Mind Reader Publications ISSN No: 50-310 A Lower Bound Theorem Department of Applied Mathematics, Beijing University of Technology,

More information

Universal examples. Chapter The Bernoulli process

Universal examples. Chapter The Bernoulli process Chapter 1 Universal examples 1.1 The Bernoulli process First description: Bernoulli random variables Y i for i = 1, 2, 3,... independent with P [Y i = 1] = p and P [Y i = ] = 1 p. Second description: Binomial

More information

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces 9.520: Statistical Learning Theory and Applications February 10th, 2010 Reproducing Kernel Hilbert Spaces Lecturer: Lorenzo Rosasco Scribe: Greg Durrett 1 Introduction In the previous two lectures, we

More information

Probability and Measure

Probability and Measure Probability and Measure Robert L. Wolpert Institute of Statistics and Decision Sciences Duke University, Durham, NC, USA Convergence of Random Variables 1. Convergence Concepts 1.1. Convergence of Real

More information

Principal Component Analysis

Principal Component Analysis Machine Learning Michaelmas 2017 James Worrell Principal Component Analysis 1 Introduction 1.1 Goals of PCA Principal components analysis (PCA) is a dimensionality reduction technique that can be used

More information

Cambridge University Press The Mathematics of Signal Processing Steven B. Damelin and Willard Miller Excerpt More information

Cambridge University Press The Mathematics of Signal Processing Steven B. Damelin and Willard Miller Excerpt More information Introduction Consider a linear system y = Φx where Φ can be taken as an m n matrix acting on Euclidean space or more generally, a linear operator on a Hilbert space. We call the vector x a signal or input,

More information

Kernels A Machine Learning Overview

Kernels A Machine Learning Overview Kernels A Machine Learning Overview S.V.N. Vishy Vishwanathan vishy@axiom.anu.edu.au National ICT of Australia and Australian National University Thanks to Alex Smola, Stéphane Canu, Mike Jordan and Peter

More information

Effective Dimension and Generalization of Kernel Learning

Effective Dimension and Generalization of Kernel Learning Effective Dimension and Generalization of Kernel Learning Tong Zhang IBM T.J. Watson Research Center Yorktown Heights, Y 10598 tzhang@watson.ibm.com Abstract We investigate the generalization performance

More information

Chap 1. Overview of Statistical Learning (HTF, , 2.9) Yongdai Kim Seoul National University

Chap 1. Overview of Statistical Learning (HTF, , 2.9) Yongdai Kim Seoul National University Chap 1. Overview of Statistical Learning (HTF, 2.1-2.6, 2.9) Yongdai Kim Seoul National University 0. Learning vs Statistical learning Learning procedure Construct a claim by observing data or using logics

More information

Lecture Notes 1: Vector spaces

Lecture Notes 1: Vector spaces Optimization-based data analysis Fall 2017 Lecture Notes 1: Vector spaces In this chapter we review certain basic concepts of linear algebra, highlighting their application to signal processing. 1 Vector

More information

Inverse Statistical Learning

Inverse Statistical Learning Inverse Statistical Learning Minimax theory, adaptation and algorithm avec (par ordre d apparition) C. Marteau, M. Chichignoud, C. Brunet and S. Souchet Dijon, le 15 janvier 2014 Inverse Statistical Learning

More information

Decoupling Lecture 1

Decoupling Lecture 1 18.118 Decoupling Lecture 1 Instructor: Larry Guth Trans. : Sarah Tammen September 7, 2017 Decoupling theory is a branch of Fourier analysis that is recent in origin and that has many applications to problems

More information

Kernel Method: Data Analysis with Positive Definite Kernels

Kernel Method: Data Analysis with Positive Definite Kernels Kernel Method: Data Analysis with Positive Definite Kernels 2. Positive Definite Kernel and Reproducing Kernel Hilbert Space Kenji Fukumizu The Institute of Statistical Mathematics. Graduate University

More information

Lecture 9 Metric spaces. The contraction fixed point theorem. The implicit function theorem. The existence of solutions to differenti. equations.

Lecture 9 Metric spaces. The contraction fixed point theorem. The implicit function theorem. The existence of solutions to differenti. equations. Lecture 9 Metric spaces. The contraction fixed point theorem. The implicit function theorem. The existence of solutions to differential equations. 1 Metric spaces 2 Completeness and completion. 3 The contraction

More information

Distance between multinomial and multivariate normal models

Distance between multinomial and multivariate normal models Chapter 9 Distance between multinomial and multivariate normal models SECTION 1 introduces Andrew Carter s recursive procedure for bounding the Le Cam distance between a multinomialmodeland its approximating

More information

Submitted to the Brazilian Journal of Probability and Statistics

Submitted to the Brazilian Journal of Probability and Statistics Submitted to the Brazilian Journal of Probability and Statistics Multivariate normal approximation of the maximum likelihood estimator via the delta method Andreas Anastasiou a and Robert E. Gaunt b a

More information

Brownian Motion. Chapter Stochastic Process

Brownian Motion. Chapter Stochastic Process Chapter 1 Brownian Motion 1.1 Stochastic Process A stochastic process can be thought of in one of many equivalent ways. We can begin with an underlying probability space (Ω, Σ,P and a real valued stochastic

More information

Finite-dimensional spaces. C n is the space of n-tuples x = (x 1,..., x n ) of complex numbers. It is a Hilbert space with the inner product

Finite-dimensional spaces. C n is the space of n-tuples x = (x 1,..., x n ) of complex numbers. It is a Hilbert space with the inner product Chapter 4 Hilbert Spaces 4.1 Inner Product Spaces Inner Product Space. A complex vector space E is called an inner product space (or a pre-hilbert space, or a unitary space) if there is a mapping (, )

More information

Least squares under convex constraint

Least squares under convex constraint Stanford University Questions Let Z be an n-dimensional standard Gaussian random vector. Let µ be a point in R n and let Y = Z + µ. We are interested in estimating µ from the data vector Y, under the assumption

More information