CONVERGENCE RATES OF GENERAL REGULARIZATION METHODS FOR STATISTICAL INVERSE PROBLEMS AND APPLICATIONS

Size: px

Start display at page:

Download "CONVERGENCE RATES OF GENERAL REGULARIZATION METHODS FOR STATISTICAL INVERSE PROBLEMS AND APPLICATIONS"

Mariah Haynes
5 years ago
Views:

1 CONVERGENCE RATES OF GENERAL REGULARIZATION METHODS FOR STATISTICAL INVERSE PROBLEMS AND APPLICATIONS BY N. BISSANTZ, T. HOHAGE, A. MUNK AND F. RUYMGAART UNIVERSITY OF GÖTTINGEN AND TEXAS TECH UNIVERSITY Abstract. During the past the convergence analysis for linear statistical inverse problems has mainly focused on spectral cut off and Tikhonov type estimators. Spectral cut off estimators achieve minimax rates for a broad range of smoothness classes and operators, but their practical usefulness is limited by the fact that they require a complete spectral decomposition of the operator. Tikhonov estimators are simpler to compute, but still involve the inversion of an operator and achieve minimax rates only in very restricted smoothness classes. In this paper we introduce a unifying technique to study the mean square error of a large class of regularization methods spectral methods) including the aforementioned estimators as well as many iterative methods, such as ν-methods and the Landweber iteration. The latter estimators converge at the same rate as spectral cut-off, but only require matrixvector products. Our results are applied to various problems, in particular we obtain precise convergence rates for satellite gradiometry, L 2 -boosting, and errors in variable problems. AMS subject classifications. 62G5, 62J5, 62P35, 65J, 35R3 Key words. Statistical inverse problems, iterative regularization methods, Tikhonov regularization, nonparametric regression, minimax convergence rates, satellite gradiometry, Hilbert scales, boosting, errors in variable. Introduction. This paper is concerned with estimating an element f of a Hilbert space H from indirect noisy measurements Y = Kf + noise.) related to f by a known) operator K : H H 2 mapping H to another Hilbert space H 2. The operator K is assumed to be linear, bounded, and injective, but not necessarily compact. We are interested in the case that the operator equation.) is ill-posed in the sense that the Moore- Penrose inverse of K is unbounded. The analysis of regularization methods for the stable solution of.) depends on the mathematical model for the noise term on the right hand side of.): If the noise is considered as a deterministic quantity, it is natural to study the worst case error. In the literature a number of efficient methods for the solution of.) have been developed, and it has been shown under certain conditions that the worst case error converges of optimal order as the noise level tends to see Engl et al. [3]). If the noise is modeled as a random quantity, the convergence of estimators ˆf of f should be studied in statistical terms, e.g. the expected square error E ˆf f 2, also called mean integrated square error MISE). This problem has also been studied extensively in the statistical literature, but the numerical efficiency has not been a major issue so far. It is the purpose of this paper to provide an analysis of a class of computationally efficient regularization methods including Landweber iteration, ν-methods, and iterated Tikhonov regularization, which is applicable to linear inverse problems with random noise as they occur for example in parameter identification problems in partial differential equations, deconvolution or errors in variable models. There exists a considerable amount of literature on regularization methods for linear inverse problems with random noise. For surveys we refer to O Sullivan [35], Nychka & Cox [34], Evans & Stark [4] and Kaipio & Somersalo [24]. A large part of the literature focusses on methods which require the explicit knowledge of a spectral decomposition of the operator K K. The simplest of these methods is spectral cut-off or truncated singular value decomposition for compact operators) where an estimator is constructed by a truncated expansion of f w.r.t. the eigenfunctions of K K e.g. Diggle & Hall [9], Healy, Hendriks & Kim [9]). It has been shown in a number of papers that spectral cut-off estimators are order optimal in a minimax-sense under certain conditions e.g. Mair & Ruymgaart [28], Efromovich [2], Kim & Koo [25]). Based on a singular value decomposition SVD) of K it is also possible to construct exact minimax estimators for given smoothness classes see Johnstone & Silverman [23]). Another major approach are wavelet-vaguelette and vaguelette-wavelet) based methods which lead to estimators of a similar functional form as SVD methods. However, in general these es-

2 timators are based on expansions of f and Kf with respect to different bases of the respective function spaces than those provided by the SVD of K e.g. Donoho [], Abramovich & Silverman [], Johnstone et al. [22]). A well-known method both in the statistical and the deterministic inverse problems literature is Tikhonov regularization. This has been studied for certain classes of linear statistical inverse problems by Cox [8], Nychka & Cox [34] and Mathé & Pereverzev [29, 3]. The main restriction of the usefulness of spectral cut-off and related estimators is the need of the spectral data of the operator i.e. an SVD if K is compact) to implement these estimators. This is known explicitly only in a limited number of special cases, and numerical computation of the spectral data is prohibitively expensive for many real-world applications. Although Tikhonov regularization does not require the spectral data of the operator, there is still a need of setting up and inverting a matrix representing the operator. For iterative regularization methods such as Landweber iteration or ν-methods see Nemirovskii & Polyak [33], Brakhage [5], and Engl et al. [3]) only matrix-vector multiplications are required. Therefore, iterative regularization methods are the methods of choice for many inverse problems in partial differential equations e.g. the backwards heat equation discussed in 5). Here K is a so called parameter-to-solution operator, i.e. an application of this operator to a vector f simply means solving a differential equation with a parameter f. On the other hand, setting up the matrix for K is often not feasible. Furthermore, it is known that Tikhonov regularization achieves minimax rates of convergence only in a very restricted number of smoothness classes, which is highlighted by the fact that its qualification number is, whereas Landweber iteration has infinitely large qualification, and ν methods with qualification ν are available for every ν > see [3]). In this paper we will show that general spectral regularization methods achieve the same rates of convergence of the MISE as spectral cut-off, which is known to be optimal in most cases see above). Whereas the bias or approximation error is exactly the same in a deterministic and a statistical framework, the analysis significantly differs in the estimation of the noise term. In spectral cut-off for compact operators, the noise or variance) part of the estimators ˆf belongs to a finite dimensional space of low frequencies. The main difficulty in the analysis of general spectral regularization methods is the estimation of the high frequency components of the noise. Unlike in a deterministic framework, the bound on the noise term depends not only on the regularization parameter, but also on the distribution of the singular values of K if K is compact). Therefore, a statistical analysis has to impose conditions on the operator. We will verify these conditions for several important problems including inverse problems in partial differential equations and errors in variable models. As an example of particular interest in the machine learning context we obtain optimal rates of convergence of L 2 boosting by interpreting L 2 -boosting as a Landweber iteration see also Bühlmann & Yu [6], Yao et al. [4]). The plan of this paper is as follows: In 2 we introduce an abstract noise model and demonstrate how other commonly used statistical noise models fit into this general framework. 3 gives a brief overview of regularization methods and source conditions. The following section contains the main results of this paper on the rates of convergence of general spectral regularization methods. Finally, in 5 we discuss the application of our results to the backwards heat equation, satellite gradiometry, errors in variable models with dependent random variables, L 2 boosting, and operators in Hilbert scales. Proofs of section 4 are collected in Noise model and spectral theorem. 2.. Spectral theorem. Halmos version of the spectral theorem see, for instance, Halmos [8], Taylor [38]) turns out to be particularly convenient for the construction and statistical analysis of regularized inverses of a self-adjoint operator. This has been demonstrated by Mair & Ruymgaart [28] for the spectral cut-off estimator. The theorem claims that for a not necessarily bounded) self-adjoint operator A : DA) H defined on a dense subset DA) of a separable Hilbert space H there exists a σ-compact space S, a Borel measure Σ on S, a unitary operator U : H L 2 Σ), and a measurable function ρ : S R such that UAf = ρ Uf, Σ-almost everywhere, 2.) 2

3 for all f DA). Introducing the multiplication operator M ρ : DM ρ ) L 2 Σ), M ρ ϕ := ρ ϕ defined on DM ρ ) := {ϕ L 2 Σ) : ρϕ L 2 Σ)}, we can rewrite 2.) as A = U M ρ U, i.e. A is unitarily equivalent to a multiplication operator. The essential range of ρ is the spectrum σa) of A. If A is bounded and positive definite as in sections 3-4, then < ρ A, Σ-a.e. Remark. In the special case that A is compact, a well-known version of the spectral theorem states that A has a complete orthonormal system of eigenvectors u i with corresponding eigenvalues ρ i, and Af = j= ρ j u j, f u j. This can be rewritten in the multiplicative form 2.) by choosing Σ as the counting measure on S = N, i.e. L 2 Σ) = l 2 N), the multiplicator function as ρi) = ρ i, i N, and defining the unitary operator U : H l 2 N) by Uf)i) := f, u i, i N Noise models. Now we will discuss several noise models commonly encountered in statistical modelling. First we introduce the general framework which will be used in the proof of our main result. Thereafter several examples will be discussed. Following Mathé & Pereverzev [29] we assume that our given data can be written as Y = g + σε + τξ, g := Kf, 2.2) where ξ H 2, ξ = is a deterministic error, ε is a stochastic error, and τ, σ > are the corresponding noise levels. Note that model 2.2) allows for stochastic and deterministic noise, simultaneously. Often, the stochastic error will be a Hilbert-space valued random variable Ξ, i.e. a measurable function Ξ : Ω H 2 where Ω, P, P ) is the underlying probability space. However, we will assume more generally that it is a Hilbert-space process, i.e. a continuous linear operator ε : H 2 L 2 Ω, P, P ). Every Hilbert-space valued random variable Ξ with finite second moments, E Ξ 2 < can be identified with a Hilbert-space process ϕ Ξ, ϕ, ϕ H 2, but not vice versa. We will use the notation ε, ϕ := εϕ, ϕ H 2. The covariance Cov ε : H 2 H 2 of a Hilbert-space process ε : H 2 L 2 Ω, P, P ) is the bounded linear operator defined implicitly by Cov ε ϕ, ϕ 2 = Cov ε, ϕ, ε, ϕ 2 ), ϕ, ϕ 2 H 2. We assume that ϕ H 2 : E ε, ϕ =, Cov ε. 2.3) We call ε a white noise process if Cov ε = I. Note that a Gaussian white noise process in an infinite-dimensional Hilbert space cannot be identified with a Hilbert-space valued random variable. If ξ : H 2 L 2 Ω, P, P ) is a Hilbert-space process and A : H 2 H is a bounded linear operator, we define the Hilbert-space process Aξ : H L 2 Ω, P, P ) by Aξ, ϕ := ξ, A ϕ, ϕ H. Its covariance operator is given by Cov Aξ = ACov ξ A. Assumption. For the noise model 2.2) the assumptions 2.3) and ξ = hold true. Moreover, K ε is a Hilbert-space valued random variable satisfying E K ε 2 <, 2.4) and there exists a spectral decomposition 2.) of K K such that for almost all s S Var UK εs)) ρs). 2.5) In the remainder of this section we show that several commonly used noise models fit into the general framework 2.2). We start with an infinite dimensional) white noise model, and then continue with several models based on finitely many observations White noise. A frequently used model is to assume that ε in 2.2) is a white noise process in H 2 see e.g. Donoho [, ], Mathé & Pereverzev [29]). Moreover, we assume that K K is a trace-class operator, i.e. it is compact and the eigenvalues ρ j of K K satisfy trk K) := j= ρ j <. Then Cov K ε = K K, so E K ε 2 = trcov K ε) = trk K) <. 3

4 Therefore, K ε can be identified with a Hilbert-space valued random variable. Using the notation introduced in Remark and defining e j : N R by e j k) := δ jk, u j = U e j H is a unit-length eigenvector of K K to the eigenvalue ρ j, and Var UK εj)) = Var UK ε, e j = Var ε, Ku j = Ku j 2 = ρ j, for j =,, 2,.... Therefore, 2.5) is satisfied with equality Quasi-deconvolution, errors in variable, non compact operators. Suppose we want to estimate the density f of a random variable Y with values in R d, but we can only observe a random variable X = Y + W perturbed by a random variable W. Hence, our data are The density g of X is given by X,..., X n i.i.d. X = Y + W. 2.6) g = h y y)fy) dy =: Kf, R d 2.7) where h y) is the conditional density of W given Y = y. If Y and W are stochastically independent, K is a convolution operator. Recovering of f is known as deconvolution problem and has been studied extensively e.g. Stefanski & Carroll [36], Fan [5] and Diggle & Hall [9]). Dependent Y and W in 2.6) occur in many scientific applications, e.g. brightness determination of extragalactic star clusters in astrophysics, where the variance σ 2 of the noise W increases monotonically with decreasing brightness of the object Y. Here, a reasonable model is described by hz y) = 2πσ 2 y)) /2 exp z 2 /σ 2 y)) see Bissantz [3]). We assume that f L 2 R d ) and that K is a bounded, injective operator in L 2 R d ). As opposed to the previous section, in general, K is not compact here. Obviously, an unbiased estimator of q := K g is given by ˆq n z) := n hx j z z). 2.8) n j= To fit this into our general framework, we show that ˆq n = q + K ε for a Hilbert-space process ε : L 2 R d ) L 2 Ω, P, P ) defined by In fact, for ψ L 2 R d ), K ε, ψ = ε, Kψ = n ε, ϕ := n n j= n ϕx j ) g, ϕ. 2.9) j= R d hx j z z)ψz) dz K g, ψ = ˆq n q, ψ. The next result states that Assumption is satisfied: Proposition 2. Assume that the operator K defined by 2.7) is injective and satisfies K 2,2 < and K 2, <, where K r,s is defined as the operator norm of K : L r R d ) L s R d ). Moreover, let ˆq n and ε be defined by 2.8) and 2.9), and let σ := n g L + g 2 L 2 ) /2 and ε := ε/σ. 2.) Then ε satisfies Assumption, and ˆq n = q + σk ε. Proof. We have to show that 2.3)-2.5) hold true. Since the X j are assumed to be independent, it suffices to consider the case n =. The first part of 2.3), i.e. ε, ϕ = for ϕ L 2 R d ) follows from EϕX) = ϕg dx. Since Cov ε, ϕ, ε, ϕ 2 ) = ϕ ϕ 2 g dx g, ϕ g, ϕ 2 R d for all ϕ, ϕ 2 H 2, 4

5 the covariance operator of ε is given by Cov ε = M g g g, where M g means multiplication by g, and g g : L 2 R d ) L 2 R d ) is the rank- operator defined by g g)ϕ := g ϕ, g. Now Cov ε follows from the estimate Cov ε g L + g 2 L 2, which completes the proof of 2.3). To show 2.4), i.e. E ˆq n q <, note that Covˆq = K Cov ε K = K M g K K g) K g). We have to show that this is a trace class operator. Obviously K g) K g) is trace class as a rank- operator. It is not obvious, however, that K M g K is trace class since neither K nor M g are even compact in general. To show this, we rewrite the kernel of K as kx, y) := hx y y) and note that ess sup kx, ) L 2 = K 2, <. Since g, the operator K M g K is self-adjoint and positive semi-definite. Let {ϕ j : j N} be a complete orthonormal system in the separable Hilbert space L 2 R d ). The B.-Levi Theorem yields gx) Kϕ j )x) 2 dµ 2 x) ϕ j, K M g Kϕ j = j N j N = gx) kx, ), ϕ j 2 dµ 2 x) g L j N ess sup x X 2 kx, ) 2 L 2 µ ) <, which implies that K M g K is trace class with trk M g K) K 2 2,. Finally, 2.5) follows from Lemma 3. If K is a convolution operator with convolution kernel wx y), then the canonical choice of the unitary operator U in the Halmos decomposition is the Fourier transform Uϕ)ξ) = Fϕ)ξ) = ϕx)e 2πiξ x dx, 2.) R d and the multiplier function is then ρ = Fw 2. In this case the condition 2.5) in Assumption can be verified explicitly, see Mair & Ruymgaart [28]. However, for a general integral operator K as in 2.7) we know neither U nor ρ explicitly. Therefore, in the proof of Proposition 2 we rely on the following lemma to show condition 2.5). It implies that 2.5) is a condition on the choice of U in the Halmos representation 2.) rather than a condition on the noise model. The assertion that ρ L Σ) will be required in section 4. Lemma 3. If ε is a Hilbert space process satisfying 2.3), K ε is a Hilbert space valued random variable satisfying 2.4), and K is injective, then there exists a spectral decomposition 2.) of K K such that 2.5) holds true, and ρ L Σ). Proof. According to Halmos spectral theorem there exists a Borel measure Σ on a σ-compact space S, and a unitary operator Ũ : L2 R d ) L 2 Σ) such that K K = Ũ M ρ Ũ. For any Σmeasurable function χ > on S we can construct another Halmos representation of K K by introducing the Borel measure Σ := χ Σ on S and the mapping U : L 2 R d ) L 2 Σ), Uf := χ /2 Ũf since U is unitary and UK Kf = ρ Uf Σ-a.e. for all f H. In particular, we may define χs) := Var ŨK ε)s) ρs) for s M, M := {s S : Var ŨK ε)s) > }. 2.2) Here we use that ρ > Σ-a.e since K and hence K K is injective by assumption. We first consider the case ΣM c ) = where M c := S \ M. Then 2.5) holds true for s M as Var UK ɛ)s) = χs) Var ŨK ɛ)s) = ρs). Moreover, ρ dσ = Var UK ε) dσ = E UK ε 2 dσ = E K ε 2 <, 2.3) which is the assertion. Now assume that ΣM c ) >. Let ψ be an arbitrary strictly positive, function in L Σ), e.g. ψs) := js) 2 ΣAjs) )) where js) := min{j : s Aj } for a sequence 5

6 A A 2... Ω with ΣA j ) < and ΣS \ j A j) =. Such a sequence exists because Σ is σ-finite. We define χs) by 2.2) for s M and χs) := ψs) ρs) for s M c. Then 2.5) is trivially satisfied for s M c, and ρ L Σ) since M ρ dσ < as in 2.3) and ρ dσ ψ d Σ <. M c This finishes the proof Inverse regression. We now review another commonly used noise model see Wabha [39], O Sullivan [35], Nychka & Cox [34], Bissantz et al. [4]) and show how it is related to the model 2.2). Suppose that H i = L 2 µ i ) are L 2 -spaces with respect to measure spaces X i, X i, µ i ), i =, 2, H is separable, and that K : L 2 µ ) L 2 µ 2 ) is an integral operator Kf)x) := kx, y)fy) dµ y), x X 2, 2.4) X with kernel k. Recall that K K is trace class if and only if K is Hilbert-Schmidt and that K is a Hilbert-Schmidt operator if and only if k L 2 µ 2 µ ) see Taylor [37]). The latter condition is easy to verify in most applications. We will assume in the following that the measure space H 2 is finite. Then we can arrange that µ 2 X 2 ) =. We consider the regression model Y i = Kf)X i ) + ε i, f H, i =,..., n, 2.5) where we assume for simplicity that the random variables X i X 2 have uniform distribution on X 2 see also Remark 5). Moreover, we assume that Y i, X i ) Y, X), i =,..., n are i.i.d. random variables with values in R X 2 such that E [Y X] = Kf)X), 2.6) and hence E [ε X] = for ε := Y Kf)X). Finally we assume that that vx) := E [ε 2 X] satisfies < C v,l vx) C v,u < a.s., 2.7) for some constants C v,l, C v,u >. A straightforward computation shows that ˆq n = n n Y i kx i, ). 2.8) i= is an unbiased estimator of the vector q := K Kf. To fit the inverse regression model with random design in our general framework, we introduce the Hilbert-space noise) process ε : H 2 L 2 Ω, P, P ) by and show that ε, ϕ := n K ε, ψ = ε, Kψ = n n Y j ϕx j ) g, ϕ, ϕ H 2, 2.9) j= n Y j kx j, y)ψy) dµ y) K X g, ψ = ˆq n q, ψ j j= for all ψ H, i.e. ˆq n = q + K ε. Proposition 4. Assume the inverse regression model 2.4)-2.7), and let ˆq n and ε be defined by 2.8) and 2.9). Moreover, let K : L 2 µ ) L 2 µ 2 ) be Hilbert-Schmidt, and µ 2 ess sup kx, ) L 2 µ ) <. Define σ := C n and 6 ε := ε/σ,

7 with C := C v,u + g 2 L µ + 2) g 2 L 2 µ 2). Then ε satisfies Assumption for the unitary transform U defined in Remark, and ˆq n = q + σk ε. Moreover, C v,l n ρj) Var U ˆq n)j)), j =,, 2, ) Proof. It suffices to prove this for n =. Since X is uniformly distributed and 2.6) holds true, we have E Y ϕx)) = E E [ε X]ϕX)) + E gx)ϕx)) = gϕ dµ 2 = g, ϕ for all ϕ H 2 and hence the first part of eq. 2.3) holds true. Using once more the same properties of X and Y we find that Cov ε, ϕ, ε, ϕ 2 ) = E { Y 2 ϕ X)ϕ 2 X) } g, ϕ g, ϕ 2 = E { ε 2 + 2εgX) + gx) 2 )ϕ X)ϕ 2 X) } g, ϕ g, ϕ 2 = ϕ v 2 + g 2) ϕ 2 dµ 2 g, ϕ g, ϕ 2 for all ϕ, ϕ 2 H 2. Hence, Cov ε = M v 2 +g 2 g g where M v 2 +g 2ϕ := v2 + g 2 ) ϕ and g g)ϕ := g, ϕ g. This implies Cov ε C and finishes the proof of 2.3). Using the notation of Remark, condition 2.4) can be seen as follows: Since E ˆq q 2 = tr Covˆq q) = Ku j, Cov ε Ku j C j= j= Ku j 2 = C trk K) <. Var U ˆq )j) = u j, Covˆq u j = Ku j, Cov ε Ku j C Ku j 2 = C ρ j, we obtain the bound 2.5). The lower bound in 2.2) holds true since the operator M 2 g g g is positive definite as covariance operator of ε for the case ε. As opposed to [28] we do not need the assumption that the singular vectors u j H and v j H 2 in the singular value decomposition Kf = j= ρj f, u j H v j be uniformly bounded sequences in L µ ) and L µ 2 ), respectively. We only require that µ 2 ess sup kx, ) L 2 µ ) <. This condition is often less restrictive and easier to verify. Remark 5. Generalizations:. We can replace L 2 µ ) by an arbitrary Hilbert space H e.g. a Sobolev space) by replacing kx, ) by kx) := j= ρj vx)u j, x X 2. Then 2.4) and 2.8) read Kf)x) = kx), f H and ˆq n = n n i= Y i kx i ), respectively. Proposition 4 remains valid with literally the same proof if L 2 X ) is replaced by H. 2. deterministic and nonuniform design). The noise model 2.2) also allows to treat models of the form 2.5) where the measurement points are either nonuniformly distributed on X 2 or x i = x n) i are deterministic quantities see, for instance, Nychka & Cox [34], O Sullivan [35]). For conditions on the design density see Munk [32]. 3. Regularization methods. We now review some basic notions of regularization theory. 3.. Regularized estimators. Recall Halmos spectral theorem from 2.. For a self-adjoint operator A : DA) H and a bounded, measurable function Φ : σa) R one defines an operator ΦA) LH) by ΦA) = U M Φρ) U, 3.) 7

8 see e.g. Taylor [38]). The mapping Φ ΦA), called the functional calculus at A, is an algebra homomorphism from the algebra of bounded measurable functions on σa) to the algebra LH) of bounded linear operators on H, and ΦA) sup Φλ), 3.2) λ σa) with equality if Φ is continuous. We will construct estimators of the input function by regularization methods of the form ˆf,σ = Φ K K)K Y. 3.3) Here Φ : σk K) R is a collection of bounded filter functions approximating the unbounded function t t on σk K), which are parametrized by a regularization parameter >. A particular example of a regularization method of the form 3.3) is the spectral cut-off estimator also known as truncated singular value decomposition) described by the functions Φ SC t) := { t, t,, t <. As explained in the introduction, we will focus on regularization methods which can be implemented without explicit knowledge of the spectral decomposition of the operator K K. This includes both implicit methods such as Tikhonov regularization Φ t) = + t) ), iterated Tikhonov regularization and Lardy s method, which involve the inversion of an operator and explicit methods such as Landweber iteration Φ /k+) t) = k j= βt)j where β, K K 2 is a step-length parameter) and ν-methods, which require only matrix-vector products in a discrete setting. For a derivation and discussion of these methods we refer to the monograph [3] Smoothness classes. We will measure the smoothness of the input function f relative to the smoothing properties of K in terms of source conditions: Let Λ : [, ) [, ) be a continuous, strictly increasing function with Λ) =, and assume that there exists a source w H such that f = ΛK K)w. 3.4) The set of all f satisfying this condition with w H w, w > will be denoted by F Λ,w,K K := {ΛK K)w : w H, w w}. We will shortly write F Λ,w := F Λ,w,K K if there is no ambiguity. The most common choice, which is usually appropriate for finitely smoothing operators K is Λt) = t ν, ν >. 3.5) In particular, 3.4) with Λt) = t is equivalent to f = K v, v H2 see Engl et al. [3, Prop. 2.8]). For exponentially ill-posed problems such as the backwards heat equation, 3.5) is usually too restrictive and logarithmic source conditions corresponding to the choice Λt) = ln t) p, p >, 3.6) are more appropriate see Hohage [2], Mair [27]). Since Λ is singular at t =, we assume that the norms in H and H 2 are scaled such that K K < in this case. For a further discussion of source conditions and interpretations as smoothness conditions in Sobolev spaces we refer to the applications in section 5. If f belongs to the smoothness class F Λ,w and we are given exact data Y = g, then the error is bounded by Φ K K)K g f = Φ K K)K K I)ΛK K)w sup Φ t)t )Λt) w, 3.7) t σk K) where we have used 3.2). 8

9 3.3. Assumptions. In the following we discuss a number of standard assumptions on the filter functions Φ, which are satisfied for all commonly used regularization methods. First, we assume that there exists a constant C 2 > such that sup tφ t) C 2, uniformly in >. 3.8a) t σk K) To bound the so called propagated deterministic noise error τ Φ K K)K ξ, we impose the condition there exists C 3 > : sup > sup Φ t) C 3. t σk K) 3.8b) In view of the bound 3.7) on the approximation error, we also assume that there exists a number ν > called qualification of the method and constants γ ν > such that sup t ν tφ t)) γ ν ν, for all and all ν ν. 3.8c) t σk K) The qualification of a method is a measure of the maximal degree of smoothness in terms of the Hölder-type conditions 3.4), 3.5) under which the approximation error 3.7) converges of optimal order. The qualification some commonly used methods is: Tikhonov regularization:, K-times iterated Tikhonov regularization: K, Landweber iteration: in the sense that it is greater than any real number), ν-methods: ν where ν > is a parameter in the method), see references in the Introduction. Note that the condition 3.8c) with ν > implies that lim Φ t) = t for all t σk K). For ν = the condition 3.8c) implies 3.8a) with C 2 = +γ. However, this value of C 2 is usually not optimal as for most regularization methods 3.8a) holds true with C 2 =. For general source conditions we assume that there exists a constant γ Λ such that sup Λt) tφ t)) γ Λ Λ),. 3.9) t σk K) Under Hölder-type source conditions 3.5) this holds true for ν ν by assumption 3.8c). For the choice Λt) = ln t) p, it has been shown in Hohage [2] that 3.8c) with ν > implies 3.9). For more general functions Λ we refer to Mathé & Pereverzev [3] for similar implications. 4. MISE estimates. In this section the main results of this paper are presented. Recall the definition of the estimator ˆf,σ of the input function f in 3.3). Since E Φ K K)K ε =, the MISE satisfies the bias-variance decomposition with the bias term B ˆf,σ ) := E ˆf,σ f. E ˆf,σ f 2 = B ˆf,σ ) 2 + E ˆf,σ E ˆf,σ 2, 4.) 4.. Estimation of the bias. The bias in our model coincides with the error in a deterministic setting and can be estimated by standard techniques see [3]). Using the triangle inequality, the noise model 2.2), 2.3), and the definition 3.3) of ˆf,σ, we get B ˆf,σ ) Φ K K)K Kf f + τ Φ K K)K ξ. The first term called approximation error) is bounded by γ Λ Λ)w due to 3.7) and 3.9). For the second term called propagated deterministic noise error) we obtain the bound Φ K K)K ξ 2 = Φ KK )ξ, KK Φ KK )ξ C 2C 3 using the identity Φ K K)K = K Φ KK ), see [3, eq. 2.43)]) and 3.8). Hence, 4.2) B ˆf,σ ) γ Λ Λ)w + 9 C2 C 3 τ. 4.3)

10 Since we aim to show optimality of general regularization methods by comparison to spectral cutoff see Introduction and 4.3), we now compare the approximation errors of general regularization methods and spectral cut-off. To this end, we introduce the following notations. Notation: For two real-valued functions f, g defined on an interval, ᾱ] we write f) g) or f) < g)),, f) if g) for in some neighborhood of and lim g) Furthermore, we write = or lim sup f) g). f) g),, if there exist constants ᾱ > and Cᾱ such that /Cᾱ)f) g) Cᾱf) for < ᾱ. Recall that Λ : [, ) [, ) is assumed to be a strictly increasing, continuous function with Λ) = and that tφ SC t) = χ [,] t), i.e. I K KΦ SC K K)) is an orthogonal projection operator. Therefore, sup I K KΦ SC K K))f = sup tφ SC t))λt)w Λ)w,. f F Λ,w t σk K) The last relation holds since is not an isolated point of the spectrum σk K) for ill-posed operator equations. Using 3.7) and 3.9) we obtain the estimate sup I K KΦ K K))f γ Λ Λ)w γ Λ sup I K KΦ SC K K))f 4.4) f F Λ,w f F Λ,w as. For many regularization methods and smoothness classes we have γ Λ Estimation of the integrated variance and rate of convergence of the MISE. The more difficult part is the estimation of the integrated variance of the error ˆf,σ f. Under Assumption we have E ˆf,σ E ˆf,σ 2 = σ 2 E Φ ρ)uk ε 2 σ 2 Φ 2 ρ)ρ dσ. 4.5) A crucial point in the following analysis is the estimation of the tails of the spectral function ρ. To this end, we bound the variance in terms of the function R) := Σ{ρ }), >. 4.6) In order to control the MISE of ˆf,σ as it is tempting to assume that R is smooth in a neighborhood around. However, this is not true in general. Therefore, we will pose instead that R can be approximated suitably by a smooth function S with similar properties as R, as. Obviously, R is monotonically decreasing see 4.8a) below). If ρ belongs to L Σ), then dr) = ρ dσ < see 4.8b)), and it follows from Lebesgue s dominated convergence S theorem that lim R) = lim S {ρ } dσ = see 4.8c)). Assumption 2. There exists a constant ᾱ, ρ ] and a function S C 2, ᾱ]) such that R) S),, 4.7) with R defined by 4.6) in terms of the spectral decomposition 2.), and S satisfies S S <, S ) is integrable on, ᾱ], lim S) =, 4.8c) γ S, 2), ᾱ] : S ) S ) γ S. 4.8a) 4.8b) 4.8d)

11 We will show in section 5 for a number of examples that this assumption is satisfied. Now we are in the position to give an estimate of the MISE. The estimate of the MISE in the image space H 2 in 4.) is needed in the analysis of L 2 boosting 5.4) and for nonlinear inverse problems. Theorem 6. Consider the model 2.2), and let Assumptions and 2 hold true. We define a general spectral estimator ˆf,σ by 3.3) and assume that Φ satisfies 3.8).. If condition 3.9) is satisfied for the function Λ defining the smoothness class F Λ,w,K K, then for all f F Λ,w,K K the MISE can be asymptotically bounded by ) 2 E ˆf,σ f 2 H γ < C2 C 3 Λ Λ)w + τ + C2 2 + C3)σ Sβ) dβ,. 4.9) 2. Assume that g F Λ,w,KK H 2 and that Λ satisfies 3.9). If g = Kf with f F Λ,w,K K, then Λt) := tλt), but we do not assume g RK) here!) Then E K ˆf,σ g 2 H 2 < ) 2 C γ Λ Λ)w 2 + C2 τ C3)σ βsβ) dβ,. 4.) Note that for statistical inverse problems as opposed to deterministic inverse problems, the estimates of the noise term and hence the rates of convergence of the MISE do not only depend on the relative smoothness of the solution i.e. on Λ), but also on the operator i.e. on S). Remark 7. We comment on the choice of the regularization parameter >. If the noise levels σ and τ, the spectral properties of K K i.e. S) and the smoothness of f i.e. Λ) are known, one can choose by minimizing the right hand side of 4.9). Since typically the smoothness of the solution is not known a-priori, so called adaptive methods must be employed for the selection of. We do not intend to review the considerable amount of literature on this topic here, but want to mention that the explicit bounds on the variance given in Theorem 6 allow the application of the Lepskij balancing principle as proposed for inverse problems by Mathé & Pereverzev [3, 3] and Bauer & Pereverzev [2]. We will discuss this in more detail elsewhere. With this method one typically loses a log factor in the asymptotic rates of convergence. In most cases this can be avoided by using Akaike s method as studied for spectral cut-off and related methods by Cavalier et al. [7]. Unfortunately, Assumption 2 in this paper is not satisfied for the methods discussed here Comparison with spectral cut-off. To show that with an optimal choice of our estimators can achieve the best possible order of convergence among all estimators as σ, we compare them to the spectral cut-off estimator for which minimax results are known in many situations see references in the introduction). Since we are mainly interested in the case that the statistical noise is asymptotically dominant, we will assume that τ = for simplicity. Moreover, we assume in addition to 2.5) that the lower bound Var UK εs)) γ var ρs). 4.) holds true for some constant γ var >. For the white noise model this is satisfied with γ var = and for the inverse regression model with γ var = C v,l /C see 2.2)). Moreover, we need the following assumption to prove optimal rates in many mildly ill-posed problems. Assumption 3. There exists a constant C 4 > such that for all, ᾱ] C 4 S ) S). 4.2) Theorem 8. Let Assumptions and 2 and the lower bound 4.) hold true and assume that the family of functions {Φ } satisfies 3.8). Moreover, assume that either S = R, or Assumption 3 holds true. Then the integrated variance of the estimator ˆf,σ is bounded by the integrated variance of the spectral cut-off estimator ˆf SC,σ E ˆf,σ E ˆf,σ 2 < C2 2 + κc 2 3 γ var SC SC E ˆf,σ E ˆf,σ 2,, 4.3)

12 with C 2 and C 3 as in Theorem 6 and κ := γ S /2 γ S ), γ S defined in 4.8d). Moreover, if condition 3.9) is satisfied for the function Λ defining the smoothness class F Λ,w and if τ =, then there exists a constant C > such that sup E ˆf,σ f 2 C sup E f F Λ,w f F Λ,w ˆf SC,σ f 2, 4.4) for all σ > and all > sufficiently small. Whereas condition 4.2) is usually satisfied for mildly ill-posed problems, it is not satisfied for exponentially ill-posed problems where S) c ln ) q for constants c, q >. Nevertheless, the error bounds in Theorem 6 yield optimal rates of convergence in the limit σ for logarithmic source conditions after taking the infimum over all. This is made precise in the following result which relies on a comparison of the rates for general regularization methods and bounds on the spectral cut-off rates, which are known to be optimal in many situations e.g. Mair & Ruymgaart [28]). Theorem 9. Under the assumptions of Theorem 6, Part with τ = define the increasing functions γ ) := ᾱ β drβ) and γ 2) := 2 Sβ) dβ and assume that Λ γ 2 γ ))) < CΛ),, 4.5) with the inverse function γ 2 of γ 2 and a constant C >. Then inf E ˆf,σ f 2 > < inf CγΛ Λ)w + C3 2 + C2)σ 2 2 γ ) ), σ, > i.e. if we choose the optimal value of for every noise level σ, all spectral regularization methods achieve the same rate of convergence of the MISE as spectral cut-off. Assumption 4.5) is satisfied if Λt) = ln t) p and γ ) γ 2 2 ) since Λγ 2 γ ))) Λ 2 ) = 2 ln ) p = 2 p Λ). 5. Applications. In this section we discuss how Assumption 2 of our main result Theorem 8) can be verified for some specific operators A of practical interest. A remarkable number of interesting inverse problem can be expressed in the form K K = Θ ) 5.) in terms of the Laplace operator on some compact, smooth d-dimensional Riemannian manifold M with a possibly empty) boundary M. Our first three examples are of this form. Here Θ : [, ), ) is a function satisfying lim λ Θλ) =. Under the given assumptions the Laplace operator defined on D ) := H M) H 2 M) L 2 M) i.e. with Dirichlet condition on M) is a positive, self-adjoint operator, which has a complete orthonormal system of eigenvectors u i in L 2 M) with corresponding eigenvalues λ i see e.g. Taylor [38, Chap. 8.2]). Hence the operator on the right hand side of 5.) defined in 3.) can be written as Θ )f = i Θλ i) f, u i u i for f L 2 M). Due to a famous result of Weyl see Taylor [38, Ch.8, Thm 3.. and Cor.3.5]), the distribution of the eigenvalues has the asymptotic behavior Nλ) := #{λ i : λ i λ}, λ Nλ) c M λ d/2 vol M, c M := Γ d 2 + ) 5.2) 4π) d/2 as λ, where vol M = dx denotes the volume of M. Under the given assumptions the M operator A is compact as operator norm limit of the finite rank operators k i= Θλ i) u i, u i as k. Assume that Θλ) is monotonically decreasing for λ λ and that Θλ) > := Θλ ) for λ < λ. As lim λ Θλ) =, the inverse function Θ :, ] [λ, ) satisfies lim Θ) =. If the spectral decomposition of A is chosen as in Remark, then the function R defined in 4.6) satisfies R) = #{λ i : Θλ i ) } = N Θ) ) c M Θ) ) d/2,. 5.3) 2

13 5.. Backwards heat equation. We consider the inverse problem to reconstruct the temperature at time t = on M from measurements of the temperature at time t = T. The forward problem is described by the parabolic equation t ux, t) = ux, t), x M, t, T ) ux, t) =, x M, t, T ] ux, ) = fx), x M, 5.4) with an initial temperature f L 2 M) and the final temperature in gx) := ux, T ), x M. We have g = exp T )f, i.e. K = exp T ) LL 2 M)) and K K = exp 2T ). Hence, Θλ) = exp 2T λ) in 5.). By virtue of 5.3) the condition R S is satisfied for S) := c M ) d/2 2T ln. It is easy to check that this function satisfies the conditions 4.8). In particular S ) S ) = d 2 ) 2 ln 5.5) so 4.8d) holds with any γ S, 2) for sufficiently small ᾱ if d 3 and with γ S = for all ᾱ < for d 2. If M is a compact Riemannian manifold without boundary, then the smoothness class F Λ, for a logarithmic source condition 3.6) is the unit ball in the Sobolev space H 2p M) with respect to some equivalent norm see Hohage [2]). Similar results hold true if M has a boundary with a Dirichlet or Neumann condition. In this case we additionally need to impose boundary conditions. Hence, if the initial temperature is bounded in some Sobolev norm, f H s C, s = 2p >, if the regularization parameter is chosen such that σ, and if τ = Oσ µ ) with µ > 2 as σ, then it follows from Theorem 6 after some elementary computations that the MISE decays like for all regularization methods satisfying 3.8). E ˆf,σ f 2 L 2 = O ln σ) s), σ 5.2. Satellite gradiometry. In satellite gradiometry measurements of the gravitational force of the earth at a distance a from the center are used to reconstruct the gravitational potential u at the surface of the earth see Hohage [2], Bauer & Pereverzev [2] and references therein). Let the earth be described by E := {x : x < }, and let M := E denote the surface of the earth. Then u satisfies the Laplace equation ux) =, x R 3 \ E and decays like ux) = O x ) as x. The available data consist of noisy measurements of the rate of change of the gravitational force u in radial direction r = x, i.e. gx) := 2 u x), for x = a. r2 A discussion of the measurement errors shows that they are mainly of random nature see [2]). The problem is to determine the potential f = u M at the surface M of the earth. Introducing the operator K : L 2 M) L 2 am) mapping the solution f to the data g, we can write K K in the form 5.) with Θλ) = ΦΛλ)) and ) 2 ) 2 3 Φt) := c 2 + t 2 + t a 2t, Λλ) := λ + 2 3

14 see Hohage [2]). It is easy to show that Φt) is decreasing for sufficiently large t and that Λλ) is monotonic increasing for all λ >. Obviously, Θ) = Λ Φ) ) = Φ) 2 2. The function Φ) cannot be computed explicitly, but we can estimate its asymptotic behavior as. Writing t = Φ) for sufficiently small and pt) := c 2 + t) t) 2 we obtain ) t Φ) = log a log a = log t a log a pt)a 2t ) = log t a ln log a pt)) + 2t 2 ln a, as. Therefore, using 5.3), we get R) = c M Θ) ln ) 2,. 2 ln a The function S) := ln 2 ln a) 2 satisfies the conditions 4.8) see 5.5)). Moreover, the smoothness classes F Λ, for logarithmic source conditions 3.6) are unit balls in the Sobolev spaces H p M) w.r.t. equivalent norms see Hohage [2]). Since the gravitational potential satisfies the Poisson equation u = φ in R 3 and since the mass density φ of the earth E belongs to L 2 E), it follows from elliptic regularity results that u H 2 E), so f = u M H 3/2 M) in the sense of the trace operator see e.g. Taylor [37]). Therefore, if τ = Oσ) and if we choose σ. E ˆf,σ f 2 L 2 = O ln σ) 3), σ, 5.3. Operators in Hilbert scales. In the following we show that our assumptions are satisfied for operators acting in Hilbert scales see Mair & Ruymgaart [28], Mathé & Pereverzev [29]). Hence, spectral regularization methods yield optimal rates of convergence for this class of operators. Let L : DL) H H be an unbounded, positive, self-adjoint operator defined on a dense domain DL) H, and assume the inverse L : H H is bounded. Then L generates a scale of Hilbert spaces H µ, µ R defined as completion of n N DLn ) under the norm generated by the inner product f, g µ := L µ f, L µ g. We have H µ H λ for µ, λ R with µ > λ. A prototype is L = I with the Laplace operator on a closed manifold M, which leads to the usual Sobolev spaces on M. We assume that K is a-times smoothing a > ) in part of) the Hilbert scale H µ ), i.e. K : H µ a H µ is a bounded operator for all µ [µ, µ] which has a bounded inverse K : H µ H µ a. This is equivalent to C µ f µ a Kf µ C µ f µ a for all f H µ a and some constants C µ. Such conditions are satisfied for many boundary integral operators, multiplication operators, convolution operators and compositions of such operators see also the discussion after 5.)). We do not assume here that K is self-adjoint or that K K and L commute, i.e. that they can be diagonalized by the same unitary operator U. Usually the nature of the noise dictates the choice H 2 = H, and one is interested in error bounds for the estimator in positive norm, i.e. H = H µ a for µ a. Then the operator equation Kf = g is ill-posed with K = K µ a considered as an operator from H µ a to H. To verify Assumptions 2 and 3 with R S in 4.7) replaced by R S see Remark 4), we assume that L has a complete orthonormal system of eigenvectors with eigenvalues < λ λ λ 2... tending to infinity. Then the embedding operator J : H µ H is compact, and its singular values are given by σ j J) = λ µ j. It follows from the decomposition K µ a = JK µ µ a and the Courant mini-max characterization of the singular values σ j = σ j K µ a ) see e.g. Kreß [26]) that K µ a µ λ µ j σ j K µ a ) K µ µ a λ µ j, j =,,... Hence if Nλ) := #{λ j : λ j λ} and C := max K µ µ a, K µ a µ ), then R) := {σ j : σj 2 } satisfies N /C 2 ) /2µ) R) N C 2 ) /2µ). 4

15 If the counting function has the asymptotic behavior Nλ) λ d for some d >, then R) d/2µ. For the case L = I, d is the space dimension see 5.2)). A straightforward computation shows that S) := d/2µ satisfies 4.8) and 4.2) in Assumption 2 and 3 if and only if d/2µ), ). Under this condition, it follows from Remark 4 that Theorems 6 and 8 hold true with different constants. It remains to discuss the Hölder-type source conditions 3.5) in this setting. To do this we assume for simplicity that H = H 2 = H. Let K denote the adjoint of K with respect to the inner product in H. It is easy to show that K : H µ H µ+a is bounded and boundedly invertible for all µ [µ, µ]. Let l N such that [ 2al +, 2al ] [µ, µ]. Then there exists a constant γ such that γ L 2al f H K K) l f H γ L 2al f H for all f H 2al. It follows from the Heinz inequality see et al. [3], Heinz [2]) that γ σ L 2aσl f H K K) σl f H γ σ L 2aσl f H for all σ [, ] and f H 2aσl. Therefore, the source condition f = K K) ν w, w H is equivalent to f H 2aν. Let u := 2aν and f H u. Then E ˆf ),σ f 2 2u H = O σ u+a+d/2, σ, 2a for the choice σ u+a+d/2 if τ = O σ u+a u+a+d/2 ) and µ u/2a L 2 -Boosting. Boosting algorithms include a large class of iterative procedures which improve stagewise the performance of estimators. They have achieved significant interest in the machine learning context and more recently in statistics see Freund & Shapire [6] or Friedman [7] among many others). One of the main challenges is to provide a proper convergence analysis and proper stopping rules for the iteration depth see Zhang & Yu [4]). L 2 -Boosting has been introduced in the context of regression analysis by Bühlmann & Yu [6] for classification and more general learning problems. We consider the inverse regression problem described in 2.5 if K is an embedding operator and X 2 is a d-dimensional smooth, compact Riemannian manifold e.g. a smooth compact subset of R d ). Consider a weak learner of the form ˆf,n = n n Y i ky, X j ), 5.6) j= with a continuous, symmetric kernel k : X 2 X 2 R such that the integral operator K : L 2 X 2 ) L 2 X 2 ) with kernel k is compact and strictly positive definite with eigenvalues κ κ... and satisfies ess sup x X2 kx, x) < and #{κ j } d/2µ) as, 5.7) for some µ >. Further, let H µ, µ R be the Hilbert scale generated by the operator L := K /2µ) as described in 5.3. If we set H := H µ and H 2 = H = L 2 X 2 ), then H H 2, and the adjoint of the embedding operator K : H H 2 is given by K ϕ = Kϕ since ϕ, Kψ H = L µ ϕ, L µ Kψ L 2 = ϕ, ψ L 2 for all ψ L 2 X 2 ) and all ϕ H. By a similar reasoning one can show that H is a reproducing kernel Hilbert space RKHS) with reproducing kernel k, x). A typical example of a weak learner is a spline smoother which leads to Sobolev spaces H µ see [6]). Note that the weak learner 5.6) can shortly be written as ˆf,n = K Y. Boosting this learner results in a recursive iteration ˆf j+,n = ˆf j,n βk Y K ˆf j,n ), j =,, 2,..., 5.8) which is in fact Landweber iteration see 3.). Hence, Theorem 6 gives the following bound. 5

16 Corollary. Assume that k satisfies 5.7) with µ > d 2, let g H µ with µ > and β, KK 2 ]. Then the MISE is bounded by E ˆf j,n g 2 L 2 X 2) C j + ) µ/µ + n j + ) d/2µ)). 5.9) For the optimal stopping index j n) n 2µ/2µ+d) we obtain the rate E ˆf j n),n f 2 L 2 X 2) Cn 2µ/2µ+d), which is the well-known minimax rate in the case of Sobolev spaces. Proof. It follows easily from the definitions that g H µ is equivalent to g F Λ,w with Λt) = t µ/2µ for some w >. Since Landweber iteration has infinite qualification see [3]), Λ satisfies 3.9). Moreover, as the singular values of K are σ j K) = κ j, 5.7) implies that R) = #{σ j K) 2 } S) with S) := d/2µ, and S satisfies 4.8) in Assumption 2 for µ > d 2. To verify the assumptions of Prop. 4, we note that trk K) = dr) < for µ > d 2 i.e. K is Hilbert-Schmidt) and that ess sup x X2 kx, ) H = ess sup x X2 kx, ), kx, ) H = ess sup x X2 kx, x) <. Therefore, Prop. 4 and Remark 5. imply that Assumption is satisfied with σ n /2. Hence 5.9) follows from 4.) in Theorem 6 with = j + ) and Remark 4. Corollary immediately applies to all other regularization methods covered by Theorem 6. In particular, ν-methods require only the square root of the number of Landweber iterations to achieve the optimal rate, but they seem to be unknown in statistics and machine learning. Often a discretized sample variant of the iteration 5.8) is considered. Convergence of this algorithm has been analyzed by Yao et al.[4], but without optimal rates. It is still an open problem whether this discretized version achieves the minimax rates of Corollary in the general context of RKHS as it has been shown in [6] for the particular case of spline learning Errors in variable problems. We now further discuss the errors in variables problem introduced in 2. Our aim is to establish rates of convergence of estimators of the density f of F R d as the sample size n tends to infinity. Therefore, with a slight abuse of notation, we will write ˆf,n = ˆf,σn,g) in this context. It follows from the definition 2.) of σ and the boundedness of ΛK K) 2,2 that sup σn, Kf) = sup σn, KΛK K)w) w KΛK K) 2 f F Λ,w w =w n 2, + KΛK K) 2 ) 2,2 where the expression in parenthesis is finite under the assumptions of Proposition 2. We first consider two important special cases h z y) = w z) := exp π z 2 2), h 2 z y) = w 2 z) := c d exp z 2 ), x, z R d with normalization constant c d := π d/2 Γd/2 + )/Γd + ) corresponding to an error variable W independent of Y. Here K is a convolution operator, the canonical unitary transformation U in the spectral decomposition is the Fourier transform F defined in 2.), and the multiplier function is ρ j = Fw j 2, i.e. ρ ξ) = exp 2π ξ 2 2), and ρ 2 ξ) = + 4π 2 ξ 2 ) d. Hence, the corresponding functions R are given by R ) = V d ) d/2 ) d/2 2π ln, R 2 ) = V d 2π) d /d+), < <, where V d denotes the volume of the unit ball in R d. Hence, Assumption 2 is satisfied for R with S = R see 5.5)) and for R 2 with S) = V d 2π) d d/2d+2). Since the norm of the Sobolev space H s R d ) is defined by f H s R d ) = + ξ 2 ) s Ffξ) 2 dξ) /2, a simple computation shows that in the first case a logarithmic source condition 3.6) is equivalent to f H 2p R d ), and in the second case a Hölder-type source condition 3.5) is equivalent to f H 2d+)ν R d ). Suppose that f H s R d ). Then we find in the first case for the choice n /2 the asymptotic rates E ˆf,n f 2 L 2 = O ln n) s), n, 6

17 and in the second case the rate E ˆf,n f 2 L 2 = O n s s+3d/2+ ), n for the choice n d+ s+3d/2+. This generalizes results in Mair & Ruymgaart [28] for spectral cut-off to arbitrary regularization methods and to the multivariate setting. We now consider the case that the random variables Y and W are not stochastically independent. We assume that the conditional density h is of the form hx y y) = wx y) + px, y) 5.) where c + ξ 2 2) a Fwξ) 2 c + ξ 2 2) a for some constants a, c, c >, and p is C - smooth and decays exponentially as x, y. Then the convolution operator K with kernel w is bounded and boundedly invertible from H µ a R d ) to H µ R d ) for all µ R, and the integral operator P with kernel p is compact from H µ a R d ) to H µ R d ) for all µ R. Under our general assumption that K = K+P is injective, it follows from Riesz theory that K : H µ a R d ) H µ R d ) has a bounded inverse. Hence, it follows from the arguments of the previous paragraph that Hölder source condition 3.5) for K are equivalent to f H 2aν R d ). If we additionally assume periodicity of w and p with arbitrary size of the periodicity interval, then it follows from our results on operators in Hilbert scales that also Assumption and 2 are satisfied. 6. Appendix: Proofs and auxiliary results. This section contains the proofs our main results on the MISE. First, we require some technical lemmas. Lemma. If Assumption holds true and the family of functions {Φ } satisfies 3.8), then E ˆf,σ E ˆf,σ 2 σc 3) 2 2 E K ˆf,σ E K ˆf,σ 2 σc 2 ) 2 R) σc 3) 2 β drβ) σc 2 ) 2 2 β 2 drβ). β drβ), 6.a) 6.b) Proof. Recall the bound 4.5) on E ˆf,σ E ˆf,σ 2. We split the integral on the right hand side of 4.5) of the variance over the frequency domain S into low frequency components {ρ } and high frequency components {ρ < }. The low frequency components are bounded by {ρ } Φ ρ) 2 ρ dσ C 2 2 {ρ } ρ dσ = C2 2 β drβ), where the latter equality holds by a transformation of the integral on the l.h.s. to an integral with respect to the image measure Σ ρ, and subsequent reformulation as the Lebesgue-Stieltjes integral given on the r.h.s. of the equation. Similarly, the high frequency components of the variance can be estimated by Φ ρ) 2 ρ dσ C2 3 2 ρ dσ = C2 3 {ρ<} 2 β drβ) {ρ<} using 3.8b). This completes the proof of 6.a). In analogy to 4.5) we have E K ˆf,σ E K ˆf,σ 2 σ 2 S Φ ρ) 2 ρ 2 dσ, and the right hand side of this inequality can be estimated as above to establish the bound 6.b). The next lemma shows that for R = S the high frequency components of the variance are asymptotically bounded by low frequency components and that the relative magnitude of these components is determined by the constant γ S in 4.8d). Lemma 2. Assume that S C 2, ᾱ]) satisfies 4.8), and define κ := Then ᾱ 2 β dsβ) κ β dsβ) κ γs 2 γ S, i.e. 2κ κ+ = γ S. S ᾱ),, ᾱ]. 6.2)

Unbiased Risk Estimation as Parameter Choice Rule for Filter-based Regularization Methods

Unbiased Risk Estimation as Parameter Choice Rule for Filter-based Regularization Methods Frank Werner 1 Statistical Inverse Problems in Biophysics Group Max Planck Institute for Biophysical Chemistry,