ORACLE INEQUALITY FOR A STATISTICAL RAUS GFRERER TYPE RULE

Size: px

Start display at page:

Download "ORACLE INEQUALITY FOR A STATISTICAL RAUS GFRERER TYPE RULE"

Jessica Morrison
5 years ago
Views:

1 ORACLE INEQUALITY FOR A STATISTICAL RAUS GFRERER TYPE RULE QINIAN JIN AND PETER MATHÉ Abstract. We consider statistical linear inverse problems in Hilbert spaces. Approximate solutions are sought within a class of one-parameter linear regularization schemes in which the choice of the parameter is crucial for controlling the root mean squared error. A variant of Raus Gfrerer rule is proposed and analyzed. It is shown that this parameter choice gives rise to error bounds in terms of oracle inequalities, which in turn provide order optimal error bounds (up to logarithmic factors). These bounds are established only for solutions that obey certain self-similarity properties. The proof of main result relies on some auxiliary error analysis for linear inverse problems under general noise assumptions, which might be interesting in its own. Key words. statistical inverse problem, Raus Gfrerer parameter choice, oracle inequality AMS subject classifications. 47A52, secondary: 65F22, 65C6 1. Introduction. In this paper we introduce a new parameter choice strategy for statistical linear inverse problems in Hilbert spaces. We consider the linear equation y δ = T x + δξ, (1.1) where T : X Y is a compact linear operator between Hilbert spaces X and Y, the parameter δ > denotes the noise level, and ξ stands for the additive noise, to be specified later as white Gaussian noise, which leads to observations y δ. This is a standard model considered in statistical inverse problems. By using the singular system s j, u j, v j } of T to write T x = j s j x, u j v j, x X, the above model (1.1) is seen to be equivalent to the sequence space model y δ j = x j + δξ j, j = 1, 2,..., with observations yj δ = y δ, v j /s j, the noise ξ j is centered Gaussian with variance δ 2 /s 2 j. The unknown solution x has coefficients x j with respect to the basis u j, j = 1, 2,... This model is frequently analyzed, and we mention the recent survey [3]. In particular the minimax error is clearly understood if the solution sequence x j, j = 1, 2,... belongs to some Sobolev type ball. In particular, a series estimator ˆx k (y δ ) = k j=1 c jyj δ (with appropriately chosen weights c j) is (almost) optimal. The important question is how to choose the truncation level (parameter, model) k based on the given data and the noise level δ. Parameter choice in statistical inverse problems, called model selection in this field, is an important issue, and we refer to [3] for a survey on this. The study [4] also establishes an oracle inequality in a similar model as considered here. These authors base the parameter choice on the principle of unbiased risk estimation. The optimal parameter is then obtained from a minimization procedure for the unbiased estimator of the error over a large range of parameters. Another commonly used parameter choice is the Lepski principle, established in [15], and studied since then in various context. This is a universal approach, known to yield oracle Mathematical Sciences Institute, Bldg 27, Australian National University, ACT 2 Australia (qinian.jin@anu.edu.au) Weierstraß Institute for Applied Analysis and Stochastics, Mohrenstraße 39, 1117 Berlin, Germany (peter.mathe@wias-berlin.de) 1

2 2 Q. JIN AND P. MATHÉ bounds in many cases. This approach also covers the classical regularization theory, i.e., when the noise ξ in (1.1) is bounded. In this approach, the parameter is chosen in a way which starts with a smallest one (in which case the signal is dominated by the noise), and increases the parameter until signal can be detected (signal dominates noise). Again, problems with small parameters, which are ill-conditioned, need to be solved. The most prominent parameter choice in classical regularization theory is the discrepancy principle, see e.g., [5]. In contrast to the approaches outlined before, this starts from a large initial parameter, and decreases until the data misfit is of the size of the noise level, thus avoiding the computation of solutions of potential ill-conditioned problems. Here, for any estimator ˆx = ˆx(y δ ) this amounts to achieve that T ˆx y δ δ. Similarly as the Lepski principle, the discrepancy principle is of universal character, and it is known to yield optimal reconstruction in various context. Only recently, the discrepancy principle has been analyzed within the statistical context in [2]. As pointed out in [2], several problems are faced when using the data misfit in statistical inverse problems. First, since the white noise ξ is not an element in Y, the discrepancy T ˆx y δ is not well-defined. Therefore, for statistical inverse problems, the traditional discrepancy principle can not be applied directly. In order to make the discrepancy principle applicable to statistical inverse problems, we may consider some preconditioning of the equation (1.1). Often this can be achieved by using the symmetrized equation with A := T T and ζ := T ξ, as z δ = T y δ = Ax + δt ξ = Ax + δζ. (1.2) Thus, if the operator A has finite trace, then the new misfit Aˆx z δ is almost surely finite, resolving the first problem. Secondly, to apply the discrepancy principle it is tempting to require that Aˆx(z δ ) z δ = Cδ, (1.3) which gives the discrepancy principle for the symmetrized equation. However, as was pointed out in [2], this plain use of the discrepancy principle (with a large constant C) leads only to suboptimal performance. Instead, the misfit Aˆx(z δ ) z δ should be weighted, and if done accordingly, this can yield optimal rates of reconstruction. To be specific we consider the family of estimators x δ = (I + A) 1 T y δ, > defined by Tikhonov regularization. The authors in [2] studied the modified discrepancy principle (λi + A) 1/2 (Ax δ z δ ) δc(λ), (1.4) with a correction factor C(λ) which accounts for the impact of the weight function on the level of the data misfit. It is shown there that an appropriate choice of λ > yields order optimal reconstruction in many cases. However, the choice of λ requires to know the smoothness of solutions, which makes the discrepancy principle an a priori rule. To overcome this drawback, the varying discrepancy principle (I + A) 1/2 (Ax δ z δ ) δc(), (1.5)

3 RG-RULE FOR STATISTICAL INVERSE PROBLEMS 3 was proposed in [16] by imposing λ = in (1.4) to make the principle into an a posteriori one, and thus the weight depends on the parameter under consideration. The main achievement in [16] is that this new principle may yield optimal order reconstruction (up to a logarithmic factor). However, it became transparent that such result holds only for solutions x which satisfy certain self-similarity properties. This has an intuitive explanation: For large values of, and this is where the discrepancy principle starts with, the misfit is dominated by the large singular numbers s j. However, the approximation order is determined by all of the spectrum. The last problem faced when using the discrepancy principle is related to early saturation: The regularization scheme, which is used to determine the candidate solutions x δ must have higher qualification than given by the underlying smoothness in terms of general source conditions. For instance, if we use Tikhonov regularization, whose qualification is known to be 1, see [5], then the varying discrepancy principle gives order optimal reconstruction only for smoothness up to 1/2. Any parameter choice which exhibits early saturation cannot obey oracle type bounds. The early saturation, which is inherent in the discrepancy principle in classical regularization context, can be overcome by turning to the Raus Gfrerer rule (RG-rule), proposed independently by Raus [2] and Gfrerer [6] in the classical (deterministic) setting. As Raus and Gfrerer proposed, instead of the weighted discrepancy from (1.5) an additional weight factor should be used, which results in the RG-rule (I + A) 1 (Ax δ z δ ) δc(). (1.6) This is the starting point for the present study, the application of the RG-rule within the statistical context. It will be shown that an appropriate use of the RG-rule, in particular a proper specification of the function C(), will yield order optimal results without the effect of early saturation. Actually, we will propose a statistical version of the RG-rule and establish some oracle inequality. Within the deterministic framework, such oracle inequality for the corresponding RG-rule can be found in [9]. Here we extend this approach to the statistical setup, provided that the solution obeys some self-similarity. An oracle inequality guarantees that the estimator has a risk of the same order as that of the oracle. The oracle bound in particular implies that Tikhonov regularization can achieve order optimal reconstruction up to order 1. This paper is organized as follows. We first introduce the context, formulate precisely the statistical RG rule, and then present the main result with some discussion in Section 2. The proof of the main result relies on some auxiliary results within the classical (deterministic noise) setting given in Section 3, however, under general noise assumptions. The results in this context may be interesting in their own. The proof of the main result is given in Section 4. Finally some numerical simulations are reported in Section 5 to test the performance on our statistical RG rule. 2. Setup and main result. We will use the same setup as in [2, 16]. However, the parameter choice will be different Assumptions. We start with the description of the noise. We will mimic the notion of white Gaussian noise to the present case. Let (Ω, F, P) be a (complete) probability space, and let E be the expectation with respect to P. Assumption 2.1 (white Gaussian noise). The noise ξ = (ξ(y), y Y ) in (1.1) is a stochastic process, defined on (Ω, F, P) with the properties that 1. for each y Y the random number ξ(y) L 2 (Ω, F, P) is a centered Gaussian random variable, and

4 4 Q. JIN AND P. MATHÉ 2. for all y, y Y the covariance structure is E [ξ(y)ξ(y )] = y, y. As a consequence, the mapping y ξ(y) is linear, and we shall thus write ξ(y) = ξ, y, we refer to [8] for details. The related Gaussian process ζ := T ξ has covariance E [ ζ, w ζ, w ] = w, Aw, w, w X with the operator A := T T. Assumption 2.2. The operator A has finite trace Tr [A] <. Under Assumption 2.2, Sazonov s Theorem, cf. [8], asserts that the element ζ := T ξ is a Guassian random element in X (almost surely). Therefore the equation z δ = Ax + δζ (2.1) is a well defined linear equation in X (almost surely). This will be our main model from now on. Having chosen an initial guess x X, we can construct estimators by using a linear regularization scheme x δ := x g (A)(Ax z δ ) (2.2) with a family of filter functions g : (, A ] R, >. The function r (t) := 1 tg (t), t A, is called the residual function associated with g. We assume that g } satisfies the following conditions. Assumption 2.3. The functions g } are piecewise continuous in and admit the following properties: (i) For each < t A there holds r (t) as ; (ii) r (t) 1 for all > and t A ; (iii) r (t) r β (t) for < β and t A ; (iv) There is a constant γ 1 such that sup t A g (t) γ for all < <, There are many regularization methods whose filter functions satisfy Assumption 2.3. We will discuss some examples in 2.3. As a useful corollary of Assumption 2.3, we recall the following fact from [11, Lemma 2.3]: For < β there holds Indeed, it follows from (ii) and (iii) that t r β (t) r (t) (1 + γ ) + t r β(t). (2.3) r β (t) r (t) (1 r (t))r β (t) = tg (t)r β (t). The result now follows from the observation that (t + )g (t) 1 + γ. Recall that the element ζ = T ξ is a Gaussian random element in X (almost surely). Therefore, with the estimator x δ defined by (2.2), we may consider the root mean squared error at a solution instance x, given as ( E [ x x δ 2]) 1/2. (2.4) By introducing the noise-free regularized estimators x := x g (A)(Ax z) with z = Ax,

5 RG-RULE FOR STATISTICAL INVERSE PROBLEMS 5 the mean squared error admits the bias-variance decomposition E [ x x δ 2] = x x 2 + δ 2 Tr [ g 2 (A)A ], and the noise propagation is determined by the trace term. Since Assumption 2.3 implies that ( + t)g (t) 2 (1 + γ )γ /, we have E [ x x δ 2] x x 2 + (1 + γ )γ δ 2 Tr [ ( + A) 1 A ], For further development, it is necessary to introduce the following function. Definition 2.1 (effective dimension). The function N (t) defined as N () = N A () := Tr [ (A + I) 1 A ], >, (2.5) is called effective dimension of the operator A under white Gaussian noise. According to Assumption 2.2, this function is well defined; moreover, it can be used to get the following bound on the mean squared error: E [ x x δ 2] x x 2 + (1 + γ )γ δ 2 N (). (2.6) For further properties of the effective dimension we refer to [2]. effective dimension, as in [16] we introduce the decreasing function and its companion Along with the ϱ N () := 1/ N (), > (2.7) Θ ϱn () := ϱ N (), >. (2.8) The function Θ ϱn is continuous and strictly increasing, hence its inverse is welldefined. In terms of Θ ϱn the bound in (2.6) can be rewritten as E [ x x δ ] x x + δ (1 + γ )γ, >. (2.9) Θ ϱn () This bound nicely exhibits the way in which statistical noise enters the noise propagation: in classical regularization this is propagated by δ/, whereas here the corresponding bound is δ/θ ϱn (), and hence the whole spectrum of the operator T enters Parameter choice. For the regularization scheme (2.2), it is important to choose the regularization parameter so that the mean square error becomes as small as possible. As was explained in the introduction, we will consider the weighted discrepancy s (A)(Ax δ z δ ) with suitable weight function s (t); we will take s (t) =, t, > (2.1) t + which is the residual function from Tikhonov regularization. From statistical point of view, a good parameter usually is chosen in a way such that s (A)(Ax δ z δ ) 2 is of the same magnitude as its variance which is δ 2 E [ s (A)r (A)ζ 2] [ δ 2 E s 1/2 (A)ζ 2].

6 6 Q. JIN AND P. MATHÉ Straightforward calculation shows that [ E s 1/2 (A)ζ 2] 1 = Tr [s (A)A] = N () = ϱ N () 2. (2.11) This suggests that we may choose such that s (A)(Ax δ z δ ) 2 is bounded from above by a multiple of δ 2 /ϱ N () 2. On the other hand, the equation (2.9) suggests that should be chosen such that the noise propagation term is under controlled, i.e. δ/θ ϱn () should be bounded from above by a constant. Having chosen a constant < q < 1 we select the parameter from the geometric family q := k, k := q k, k =, 1, 2,... }. (2.12) The above motivation suggests the following parameter choice rule for statistical inverse problems. Definition 2.2 (statistical RG-rule). Given τ > 1, η > and κ, let RG be the largest parameter q for which either or s (A)(Ax δ z δ δ ) τ(1 + κ) ϱ N (), (2.13) Θ ϱn () η(1 + κ)δ. (2.14) We will call the criteria (2.13) and (2.14) the regular stop and emergency stop, respectively. Notice that the regular stop in Definition 2.2 can be viewed as the Raus-Gfrerer rule applied to Lavrent iev type regularization of the symmetrized equation (2.1). Remark 2.1. The right hand side in (2.13) tends to infinity as, so that eventually stopping will occur. The emergency stop is designed in such a way, that stopping does not happen too late due to bad realizations of the noise ζ. Moreover, restricting the set of potential parameters to the grid q is for practical reasons, one could likewise search for the largest for which either (2.13) or (2.14) holds. We mention in passing, that searching for the largest value requires us to start with large values and decrease until the criterion is satisfied. This is typical for discrepancy based parameter choice, in contrast to other selection schemes, as for instance the Lepskiĭ principle, which must start from some small value, assuming that the true underlying parameter is larger Restricting the solution set. One important observation in the subsequent analysis, in particular in Section 3, will be that the RG-rule as introduced in 2.2 may fail for statistical problems (and also for bounded deterministic general noise), if the element x x has scarce spectral behavior relative to the operator A. Therefore, we shall need the following restriction for the solution x. To describe this we use the spectral resolution (E t ) t A of the non-negative self-adjoint operator A. Assumption 2.4. There exist c 1 > 1, < c 2 < 1 and < t < A such that d E t (x x ) 2 c 2 1 c 2 r 2 (t) d E t (x x ) 2

7 RG-RULE FOR STATISTICAL INVERSE PROBLEMS 7 for all < t. The inequality in Assumption 2.4 with c 2 = 1 was introduced in [19] as a generalization of a restricted form on x x in [13] for the (iterated) Tikhonov regularization. Here, we relax c 2 from c 2 = 1 to < c 2 < 1. This minor change is significant when considering the applications to the method of truncated SVD as can be seen from the following discussion. Example 2.1. For the method of truncated SVD, we have g (t) = 1/t, t,, t < and r (t) =, t, 1, t <. Thus Assumption 2.4 becomes d E t (x x ) 2 c 1 c 2 d E t (x x ) 2, < t. We observe that c d E 2 t(x x ) 2 c2 d E t(x x ) = 1 d E t (x x ) 2 2 d E t(x x ) 2 = 1 E c 2(x x ) 2 E (x x ) 2. Therefore, for this scheme, Assumption 2.4 is equivalent to the existence of constants < c 2 < 1, < θ < 1 and < t < A such that E c2(x x ) θ E (x x ), < t. (2.15) It is worthy point out that if c 2 = 1 then (2.15 fails to hold. The relaxation < c 2 < 1 makes the truncated SVD method applicable for statistical inverse problems when the parameter is chosen by the statistical RG rule. Example 2.2. For the n-times iterated Tikhonov regularization, we have r (t) = n /(t + ) n. It is easy to see that r (t) c 3 ( t ) n for t c 2, with c 3 := (c 2 /(1 + c 2 )) n. Therefore, in this case, Assumption 2.4 is equivalent to d E t (x x ) 2 c 4 2n t 2n d E t (x x ) 2, < t. c 2 This, with c 2 = 1, is the condition used in [13]. Example 2.3. For the asymptotical regularization we have r (t) = e t/. Since e t/ e 1 for c 2 t, it is easy to see that Assumption 2.4 holds if (2.15) is satisfied. Example 2.4. For the Landweber iteration with A = 1, we have r (t) = (1 t) [1/], where [1/] denotes the largest integer that is not greater than 1/. Observing that for < t 1/2 there holds (1 t) [1/] (1 t) 1/ (1 ) 1/ 1/4. Therefore, Assumption 2.4 holds if (2.15) is satisfied. We summarize the discussion on Assumption 2.4 as follows. The restriction as expressed in this assumption concerns the interplay of the spectral resolution of the

8 8 Q. JIN AND P. MATHÉ operator, the related smoothness of the element x x, and the chosen regularization scheme. As can be seen from the above examples, the chosen regularization has minor impact, and in many cases the assumption reduces to the interplay between decay rate of the singular numbers and decay rate of the Fourier coefficients, see e.g., (2.15). This is valid for moderately ill-posed situations, i.e., when both the singular numbers and smoothness are of power type. It was shown in [16] (under additional assumptions) that the set of elements x x obeying Assumption 2.4 is everywhere dense in X, and it is rare to observe the failure of the RG-rule in practice. However, we recall the heuristics, already mentioned in the introduction: For large values, the misfit is determined by the large singular numbers. Since the overall error depends on the whole spectrum through the effective dimension N (), a domination condition as expressed in Assumption 2.4 seems unavoidable Main result and discussion. The main result in this paper is as follows. Theorem 2.1. Let assumptions hold, and let x δ } be the estimators defined by the linear regularization scheme (2.2). Let RG be chosen according to the statistical RG-rule with κ = 8 log(1/δ) /N ( ). Then there is a constant C such that ( [ E x x δ 2]) 1/2 RG C inf x x + δ(1 + } log(1/δ) ). < Θ ϱn () The oracle inequality as established in Theorem 2.1 allows to state the error bound which is obtained under known general source condition and by an a priori parameter choice. We recall some notions. Definition 2.3 (general source set). Given an index function ψ that is continuous, non-negative, and non-decreasing on [, A ] with ψ() =, the set H ψ := x X : x = ψ(a)v for some v 1}, is called a general source set. For solutions x which belong to some source set, the bias x x can be bounded under the assumption that the chosen regularization has enough qualification, see e.g. [7] Definition 2.4 (qualification). The regularization is said to have qualification ψ if there is a constant γ < such that r (t) ψ(t) γψ() for > and t. Notice that x x = r (A)(x x ). If the regularization has qualification ψ and x x H ψ, then x x γψ() v γψ(). By choosing δ > to be the root of the equation ( Θ ϱn ψ() := Θ ϱn (t)ψ(t) = δ 1 + ) log(1/δ), we can use the the oracle inequality in Theorem 2.1 to obtain the following result. Corollary 2.1. Let the assumptions hold, and let x δ } be the estimators defined by the linear regularization scheme (2.2). let RG be chosen according to the statistical RG-rule with κ = 8 log(1/δ) /N ( ). If the regularization has qualification ψ then ( [ sup E x x δ RG 2]) 1/2 Cψ x x H ψ ( ( Θ 1 ϱ N ψ δ(1 + )) log(1/δ) ).

9 RG-RULE FOR STATISTICAL INVERSE PROBLEMS 9 Thus, up to a logarithmic factor, the rate in Corollary 2.1 coincides with the one from [2, Theorem 1], which is known to be order optimal in many cases. We conclude this section with an outline of the proof of Theorem 2.1. The basic idea is to reduce the argument to the one for bounded deterministic noise. In (2.11) we showed the typical performance of the noise. Therefore, we choose a tuning parameter κ, as specified in Theorem 2.1, and define the set Z κ := ζ : s 1/2 where ˆ is the largest number in q satisfying 1 (A)ζ (1 + κ) ϱ N (), ˆ q Θ ϱn (ˆ) η(1 + κ)δ. }, (2.16) The random variable ζ induces a probability on X, called P ζ [ ], below. Let Z c κ denote the complement of Z κ in X. Since X = Z κ Z c κ, we can use the Cauchy-Schwarz inequality to derive that ( E [ x x δ 2]) 1/2 sup ζ Z κ x x δ + ( E [ x x δ 4]) 1/4 (Pζ [Z c κ]) 1/4 ; (2.17) see [2, Proposition 3]. We will estimate the two terms on the right side of (2.17) with = RG. Uniformly for ζ Z κ the first term on the right can be considered as error estimate under bounded deterministic noise; and we will show in Section 3 that it can be bounded by the right hand side of the oracle inequality in Theorem 2.1. This analysis may be of independent interest. In Section 4 we will use some concentration inequality for Gaussian elements in Hilbert space to show that the second term on the right in (2.17) is negligible; this is enough for us to complete the proof of Theorem Auxiliary results for bounded noise. The situation for bounded deterministic noise which resembles the Gaussian white noise case is regularization under some specifically chosen weighted noise. We recall the function s from (2.1). As could be seen from the set Z κ in (2.16) the approriate setup will be as follows. Assumption 3.1. There is a function δ() > defined on (, ) that is non-decreasing, while δ()/ is non-increasing such that the noise ζ obeys δ s 1/2 (A)ζ δ(), ˆ q, (3.1) where ˆ q is the largest parameter such that ˆ ηδ(ˆ) with η > being a given small number. Because δ()/ is non-increasing and is strictly increasing, it is easy to see that ˆ is well-defined. Remark 3.1. The setup in Assumption 3.1 on noise covers a variety of cases which have been subsumed under the notion of general noise assumptions, we refer to [18, 1]. Specifically, let us consider the following situation. Suppose that the noise ζ allows for a noise bound for some parameter µ with In this case we can bound A µ ζ 1. (3.2) δ s 1/2 (A)ζ δ s 1/2 (A)A µ A µ ζ s 1/2 (A)A µ δ.

10 1 Q. JIN AND P. MATHÉ It is easily verified that the operator norms s 1/2 (A)A µ are uniformly bounded for > if and only if µ 1/2. In this range we easily obtain that s 1/2 (A)A µ µ, >. The two limiting cases are µ =, where we assume ζ = T ξ 1 which corresponds to large noise, and µ = 1/2, where we assume A 1/2 ζ = ξ 1 which corresponds to the usual noise assumption in linear inverse problems in Hilbert spaces. In any of the cases µ 1/2 we get a bounding function δ() = δ µ, which obeys the requirements made in Assumption 3.1. Let ˆ p be defined as in Assumption 3.1, i.e. ˆ q is the largest parameter such that ˆ ηδ(ˆ). Definition 3.1 (deterministic RG-rule). Given τ > 1 and η >, we define q to be the largest parameter such that ˆ and s (A)(Ax δ z δ ) τδ( ); (3.3) if such does not exist, we define := ˆ. We notice that the norm in the above criterion can be rewritten as s (A)(Ax δ z δ ) = s (A)r (A)(Ax z δ ) Properties of the deterministic RG-rule. We give some technical consequences of the stopping criterion which will be used later. Lemma 3.1. Let q be any parameter such that >. Then there holds δ() 1 τ 1 x x. Proof. Since >, by the definition of we must have τδ() s (A)r (A)(Ax z δ ). Therefore, it follows from Assumption 3.1 that τδ() s (A)r (A)(z z δ ) + s (A)r (A)(Ax z) s 1/2 (A)r (A) δ() + s (A)A x x. Since s 1/2 (t)r (t) 1 and s (t)t, we have s 1/2 (A)r (A) 1 and s (A)A. Consequently (τ 1)δ() x x, which gives the estimate. Lemma 3.2. Let the parameter be chosen by the RG-rule in Definition 3.1. Then where γ := max1 + τ, η x x }. s (A)r (A)(Ax z) γ δ( ),

11 RG-RULE FOR STATISTICAL INVERSE PROBLEMS 11 Proof. If = ˆ, then it follows from the definition of ˆ that ηδ( ). Consequently s (A)r (A)(Ax z) = s (A)r (A)A(x x ) x x η x x δ( ). Otherwise we have that > ˆ. Then by the definition of we have and the proof is complete. s (A)r (A)(Ax z) s (A)r (A)(z z δ ) + s (A)r (A)(Ax z δ ) s 1/2 (A)r (A) δ( ) + τδ( ) (1 + τ)δ( ), 3.2. Auxiliary inequalities: The impact of Assumption 2.4. The following inequalities may be of general interest. The first one goes back to [1, 12], see also [11, Lemma 2.4]. Lemma 3.3. For < β we have x β x 1 + γ A 1/2 s 1/2 (A)r β(a)(x x ). β Proof. We first notice that x β x = (r β (A) r (A))(x x ). The bound established in (2.3) yields that We may write x β x = (r β (A) r (A))(x x ) (1 + γ ) A ( + A) 1 r β (A)(x x ) = 1 + γ As (A)r β (A)(x x ). As (A) = A 1/2 s 1/2 (A)s 1/2 β (A)s 1/2 (A)A 1/2 s 1/2 β (A). Observing that s (t)t 1/2 and s (t) s β (t) for t, we have that s 1/2 (A)A 1/2 and s 1/2 (A) 1. Therefore β (A)s 1/2 As (A)r β (A)(x x ) A 1/2 s 1/2 β (A)r β(a)(x x ), which allows to complete the proof. The bound from Lemma 3.3 does not suffice, and we need the following strengthening, where Assumption 2.4 is crucial. Lemma 3.4. Suppose that Assumption 2.4 holds true. Then there is a constant C < such that for < there holds A 1/2 s 1/2 (A)r (A)(x x ) C s (A)r (A)A(x x ). Proof. We use spectral calculus to write A 1/2 s 1/2 (A)r (A)(x x ) 2 = I 1 () + I 2 (),

12 12 Q. JIN AND P. MATHÉ where I 1 () := I 2 () := ts (t)r 2 (t) d E t (x x ) 2 ts (t)r 2 (t) d E t (x x ) 2. We first bound I 2. For t we have that (t + ) 2t, thus 1 2 ts (t), yielding I 2 () 2 t 2 s 2 (t)r 2 (t) d E t (x x ) 2 2 s (A)r (A)A(x x ) 2. To estimate I 1 () we will use Assumption 2.4. We will consider two cases: < t and t <. When < t, we use Assumption 2.4 to obtain from ts (t) that I 1 () d E t (x x ) 2 c 2 1 Since t/(t + ) c 2 /(1 + c 2 ) for t c 2, we further obtain I 1 () c2 1(1 + c 2 ) 2 c 2 2 = c2 1(1 + c 2 ) 2 c 2 2 c 2 c 2 c 2 r 2 (t) d E t (x x ) 2. 2 t 2 (t + ) 2 r2 (t) d E t (x x ) 2 s 2 (t)r 2 (t)t 2 d E t (x x ) 2 c2 1(1 + c 2 ) 2 c 2 2 s (A)r (A)A(x x ) 2. Now we consider the case t <. We write I 1 () = I (1) 1 I (1) 1 () := t I (2) 1 () := ts (t)r 2 (t)d E t (x x ) 2, t ts (t)r 2 (t)d E t (x x ) 2. We can bound, by using Assumption 2.4, the term I (1) () as I (1) 1 () c2 1 Since t implies r t (t) r (t), we have I (1) 1 () c2 1 c 2t r 2 t (t)d E t (x x ) 2. c 2t r 2 (t)d E t (x x ) 2. 1 () + I(2) 1 (), where t Observing that for t c 2 t there holds t+ t t+ c2t c 2t +, we further obtain ( ) 2 I (1) 1 () c2 t + c 2 1 t 2 2 c 2 t c 2t (t + ) 2 r2 (t)d E t (x x ) 2 ( ) 2 = c2 1 c2 t + s 2 c 2 t (t)r(t)t 2 2 d E t (x x ) 2 c 2t ( ) 2 c2 1 c1 t + s (A)r (A)A(x x ) 2. c 2 t

13 RG-RULE FOR STATISTICAL INVERSE PROBLEMS 13 To bound I (2) 1, we observe that for t t there holds 1 +t t ts (t). Consequently I (2) 1 () + t t 2 s 2 t (t)r(t)d E 2 t (x x ) 2 t + t t s (A)r (A)A(x x ) 2. Combining the above estimates we therefore obtain the desired bound with C = ( ( ) ) 2 1/ c2 1 (1+c2)2 + +t c 2 2 t + c 2 c 2t + 1 c 2t We summarize the results from Lemma 3.3 and Lemma 3.4 as follows. Corollary 3.1. Let Assumption 2.4 hold. Then there is a constant C < such that for all < β there holds x β x C β s β (A)r β (A)A(x x ) Deterministic oracle inequality. In this section we state the main auxiliary result for bounded deterministic noise, as this seems to be of independent interest. Theorem 3.1. Let the assumptions 2.4 and 3.1 hold, and let the parameter be chosen by the RG-rule starting with. Then there holds the oracle inequality, i.e. there is a constant C such that x δ x C inf x x + δ() }. (3.4) < Proof. We first derive some preparatory results. Observing that x x = r (A)(x x ), we have from Assumption 2.3 (ii) that x x x x β, < β. (3.5) By the conditions on g we have g (t) s 1/2 (t) = 1 g (t) γ (1 + γ ) g (t)( + t). Therefore, with c = γ (1 + γ ) we have [ ] x x δ x x + g (A)s 1/2 (A) s 1/2 (A)(z z δ ) It then follows from Assumption 3.1 that x x + c s1/2 (A)ζ δ. (3.6) x x δ x x + c δ(). (3.7) Next we will prove the oracle inequality in two steps. We first restrict the oracle bound to q, and we show that x δ x C inf x x + δ() }. (3.8) q In this case we shall distinguish the cases > and, respectively.

14 14 Q. JIN AND P. MATHÉ Case >. We first have from (3.7), (3.5) and the monotonicity of δ() that x δ x x x + c δ( ) c q(τ 1) x /q x x x + c δ( /q). Since, q, we have /q q and /q >. Then we can conclude, by using Lemma 3.1 and (3.5), that ( x δ x x x ) c x x. q(τ 1) Case. We actually use Assumption 2.4 and its consequences. Based on Corollary 3.1 and Lemmas 3.2 we conclude in this case that there is a constant C < with x x x x + x x + Cγ δ( ). C s (A)r (A)(Ax z) Consequently, we deduce, using the bound (3.7) and that δ()/ is non-increasing, that x δ x x x δ( ) + c x x δ( ) δ( ) + Cγ + c ( (Cγ + c ) x x + δ() ). Finally, we show the oracle inequality in its full generality. To this end, let < be any number. Then there is j N such that j < j /q. By using (3.5), the fact that δ() is increasing, and the fact that δ()/ is decreasing, we obtain x x + δ() x j x + δ( ( j/q) q x j x + δ( ) j) j /q j q inf x β x + δ(β) }. β q β Since < is arbitrary, we obtain inf x x + δ() } q inf x x + δ() }. < q The proof is therefore complete Discussion. (a) From Lemma 3.1 and (3.5) it follows that δ( /q) /q Since δ() is non-decreasing, we obtain 1 τ 1 x /q x 1 τ 1 x x. δ( ) q τ 1 x x.

15 RG-RULE FOR STATISTICAL INVERSE PROBLEMS 15 τ 1 If in the definition of ˆ we take < η < q x x, then we always have > ˆ. Therefore, the RG rule in Definition 3.1 simply reduces to the form: is the largest parameter in q such that s (A)(Ax δ z δ ) τδ( ). The oracle inequality in Theorem 3.1 still holds for this simplified parameter choice rule. (b) The oracle inequality established in Theorem 3.1 can be used to yield error bounds when the solution x has smoothness given in terms of general source conditions, i.e., if x x belongs to some source set introduced in Definition 2.3. To see this, we assume that the regularization has qualification ψ as in Definition 2.4 and x x H ψ. We also assume, as introduced in Remark 3.1, that the noise can be bounded as A µ ζ 1, which results in δ() = δ µ for µ 1/2. Then, for the parameter, determined by the RG rule in Definition 3.1, it follows from Theorem 3.1 that x δ x C inf < ψ() + δ } 1 µ. Associated to the smoothness ψ, let Θ µ,ψ (t) := t 1 µ ψ(t), t >, which is a strictly increasing function. Given δ > we assign δ > such that Θ µ,ψ ( δ ) = δ. Then we can conclude that } x δ x C ψ( δ ) + δ 1 µ 2Cψ(Θ 1 µ,ψ (δ)), δ which was shown to be order optimal for x with the above smoothness in [18, Theorem 4]. Thus, the present results cover part of the analysis carried out in [18]; it extends the stopping criteria studied there to the RG-rule, and hence this relates to [17]. However, the above approach is limited. First, the case of small noise, i.e., when 1/2 < µ cannot be covered. Secondly, the oracle inequality is seen to hold only for those solutions x satisfying Assumption Proof of the main result for random noise. The proof of Theorem 2.1 will be carried out in several steps, similar to the one in the recent studies [2, 16]. Our starting point is the inequality (2.17). Recall that Z κ is the set defined by (2.16), i.e. Z κ X consists of those realizations of the noise ζ obeying Assumption 3.1 along the sequence,..., ˆ with where ˆ is the largest number in q satisfying δ δ() := (1 + κ), >, (4.1) ϱ N () Θ ϱn (ˆ) η(1 + κ)δ. (4.2) According to the definition of RG we have RG ˆ. In order to estimate the first term on the right of (2.17) with := RG, we observe that when ζ Z κ, the parameter defined by the deterministic RG rule in Definition 3.1 with δ() given by (4.1) is equal to the parameter RG defined by the statistical RG rule in Definition 2.2. Therefore we may use Theorem 3.1 to conclude sup x x δ RG C ζ Z κ inf < x x + (1 + κ)δ Θ ϱn () }. (4.3)

16 16 Q. JIN AND P. MATHÉ In the following we will estimate the second term on the right side of (2.17) with = RG. We need some auxiliary results. Lemma 4.1. Let Assumptions 2.2 hold. Let ˆ = qˆn q be the largest parameter satisfying (4.2). Then there is a constant C such that ˆn C (1 + log(1/δ) ). Proof. Since A > is the first eigenvalue of A, it follows from the definition of N () that N () = Tr [ (I + A) 1 A ] A + A A + A, <. Therefore, with C := ( + A )/ A, we obtain Θ ϱn () = N () C 1/2, <. According to the definition of ˆ we have Θ ϱn (ˆ/q) > η(1 + κ)δ ηδ. Consequently C (ˆ/q) 1/2 ηδ which implies the result. We shall also use some prerequisites from Gaussian random elements in Banach spaces, and we recall the following results from [14, Lemma 3.1 & Corollary 3..2]. Lemma 4.2. Let Ξ be any Gaussian element in some Banach space. Then P [ Ξ > E [ Ξ ] + b] e b2 2v 2 with v 2 := sup E [ Ξ, w ] 2. w 1 Moreover, for each p > 1 there is a constant C p such that E [ Ξ p ] 1/p C p E [ Ξ ]. We apply Lemma 4.2 to Ξ := s 1/2 (A)ζ = s 1/2 (A)T ξ. For each fixed q we denote } Z κ, := ζ : s 1/2 1 (A)ζ (1 + κ). ϱ N () Corollary 4.1. For each < there holds P ζ [ Z c κ, ] e κ2 N () 2 and ( [ E s 1/2 (A)ζ 4]) 1/4 1 C4 ϱ N (). Proof. [ ] We first estimate P ζ Z c κ,. The expected norm of Ξ was bounded in (2.11), as [ E s 1/2 ] (A)ζ ( [ E s 1/2 (A)ζ 2]) 1/2 1 = ϱ N (). (4.4) For any w X with w 1, the weak second moments can be bounded from above by E [ Ξ, w 2] [ = E ξ, T s 1/2 (A)w 2] = T s 1/2 (A)w 2 T s 1/2 (A) 2.

17 RG-RULE FOR STATISTICAL INVERSE PROBLEMS 17 Thus we may apply Lemma 4.2 with b := κ/ϱ N () to conclude that P ζ [ Z c κ, ] e κ 2 2ϱ 2 N () = e κ2 N () 2, which completes the proof of the first assertion. The second one is a consequence of (4.4) and Lemma 4.2. Finally we turn to the proof of the main result. Proof of Theorem 2.1. We will use (2.17) with = RG. The first term on the right has been estimated in (4.3). By using Lemma 4.1 and Corollary 4.1 we obtain from Z c κ = ˆ q Z c κ, that P ζ [Zκ] c [ ] (ˆn + 1) sup P ζ Z c κ, C (1 + log(1/δ) ) e ˆ q κ 2 N ( ) 2. For κ = 8 log(1/δ) /N ( ) we have e κ2 N ( ) 2 = δ 4. Therefore, we obtain ( P ζ [Zκ] c C (1 + log(1/δ) ) δ 4 C log(1/δ) ) δ 4. (4.5) It remains to establish a bound for E [ x x δ RG 4]. We emphasize that the random element x δ RG is no longer Gaussian in general, since the parameter RG depends on the data ζ. Hence we cannot apply Lemma 4.2 directly. Therefore we will use the error bound (3.6) which is valid for every ζ. By using the facts that RG ˆ and that the function s 1/2 (t)/ is decreasing for each t, we obtain x x δ RG x x RG + c δ s1/2 RG (A)ζ RG x x + c δ s1/2 ˆ (A)ζ. ˆ Since ˆ is deterministic, the element s 1/2 ˆ (A)ζ is Gaussian. Thus we may use the bound on the fourth moment of s 1/2 (A)ζ given in Corollary 4.1 to obtain ˆ ( E [ x x δ RG 4]) 1/4 x x + c C 4 δ Θ ϱn (ˆ). Since the function ϱ N () is decreasing, it is easy to obtain that Θ ϱn (ˆ) qθ ϱn (ˆ/q). By the definition of ˆ we then obtain Θ ϱn (ˆ) qη(1 + κ)δ qηδ. Consequently ( E [ x x δ RG 4]) 1/4 x x + c C 4 qη. (4.6) By inserting the bounds (4.3), (4.5), and (4.6) into (2.17), and using the fact that Θ ϱn ( )/Θ ϱn () 1 for <, we can conclude that } ( [ E x x δ 2]) 1/2 RG C inf x x δ(1 + κ) +. < Θ ϱn () The proof is therefore complete. 5. Numerical simulations. In this section we will present some numerical simulations on our statistical RG rule by comparing it with the varying discrepancy principle in [16] and the well-known parameter choice rules such as the generalized cross validation (GCV) and the unbiased risk estimation (URE).

18 18 Q. JIN AND P. MATHÉ 5.1. Comparison with the varying discrepancy principle. We first consider the linear integral equation where T x(s) := 1 k(s, t) = k(s, t)x(t)dt = y(s) on [, 1], (5.1) 1s(1 t), s t 1t(1 s), s t. (5.2) It is easy to see that T is a compact operator from L 2 [, 1] to L 2 [, 1]. We assume that the solution to be reconstructed is x (t) = 2t(1 t)(3 8t 4t 2 ) Let y := T x be the exact data. By adding a multiple of white Gaussian noise ξ to y we can get a noisy data y δ = T x +δξ with noise level δ >. We then use y δ to recover x by various linear regularization methods when the regularization parameter is chosen by our statistical RG rule. All the following numerical simulations are performed by Matlab software on an equidistant grid with N = 5 points in [, 1]. As was mentioned in the introduction, our statistical RG rule was proposed with the purpose of removing the early saturation which is an inherent property of the varying discrepancy principle in [16]. It is necessary to make some numerical comparison of these two parameter choice rules. The varying discrepancy principle can be formulated, with slight modification, as follows. Definition 5.1 (varying discrepancy principle). Given τ > 1, η > and κ, let DP be the largest parameter q for which either (A)(Ax δ z δ δ ) τ(1 + κ) ϱ N () s 1/2 or Θ ϱn () η(1 + κ)δ. It is clear that the varying discrepancy principle follows from the statistical RG rule with s (A) in the regular stop (2.13) replaced by s 1/2 (A). According to the convergence analysis of these two parameter choice rules, we need to take κ = µ log(1/δ) with a constant µ >. In order to avoid terminating at a too large regularization parameter, we need to take η and µ to be small and τ 1. In our numerical simulations, we always take τ = 1.1, η =.1 and µ =.2; we also take q = q k : k =, 1, } with q = 1/1.2. Table 1. Numerical comparison between the statistical RG rule and the varying discrepancy principle Root mean square δ Statistical RG rule Varying discrepancy principle Our numerical comparison of these two rules is based on evaluating the corresponding root mean squares which are obtained by averaging over 4 simulations.

19 RG-RULE FOR STATISTICAL INVERSE PROBLEMS 19 In Table 1 we report the root mean squares related to these two rules when the estimators are defined by Tikhonov regularization with initial guess x =. The table clearly indicates the better performance of our statistical RG rule comparison with GCV and URE. In order to compare our statistical RG rule with the generalized cross validation (GCV) and the unbiased risk estimation (URE), we consider the semi-discrete version of (5.1) (T N x)(t i ) := 1 k(t i, t)x(t)dt = y(t i ) + δξ i, i = 1,, N, where T N : L 2 [, 1] R N Gaussian noise satisfying with N = 5, t i = (i 1)/(N 1), and ξ i } is white E[ξ i ] = and E[ξ i ξ j ] = The estimator from Tikhonov regularization is given by 1, i = j,, i j. x δ = (I + T N T N ) 1 T N y δ, where y δ = (y(t i ) + δξ i ). In order to choose the regularization parameter, several rules have been proposed for statistical inverse problems. The GCV criterion selects the regularization parameter as the minimizer of functional GCV () := 1 N T Nx δ y δ 2 [ 1 N Tr(I A ) ] 2 and it does not require any information on δ, while the URE selects the regularization parameter that minimizes the functional U() = 1 N T N x δ y δ 2 + 2δ2 N Tr(A ) δ 2 and it requires the knowledge of δ, where A = T N (I + T N T N) 1 T N. One may refer to [21] for detail information on these two rules. For each of these two rules, one has to solve a minimization problem to obtain the parameter. Since the cost functions involved are not necessarily convex, finding their minimizers is nontrivial. The usual procedure is to fix a range [ m, M ], select a grid j } of this interval, evaluate these functionals at these grid points and pick the one that achieves the minimum. This procedure is clearly time-consuming. Our statistical RG rule starts from a large number and then decreases it by a suitable ratio until the selection criterion is satisfied for the first time and hence it takes less time. For the numerical performance, we include the simulation results in Figure 5.1 in which we take δ =.1. The results indicate clearly that our statistical RG rule has better performance. REFERENCES [1] G. Blanchard and P. Mathé, Conjugate gradient regularization under general smoothness and noise assumptions, J. Inverse Ill-Posed Probl., 18(6), , 21. [2], Discrepancy principle for statistical inverse problems with application to conjugate gradient regularization, Technical report, University of Potsdam, 211.

20 2 Q. JIN AND P. MATHÉ reconstruction by RG 1 reconstruction by URE 1 reconstruction by GCV Fig Comparison of statistical RG rule with GCV and URE [3] L. Cavalier, Nonparametric statistical inverse problems, Inverse Problems, 24(3), 344, 19pp, 28. [4] L. Cavalier, G. K. Golubev, D. Picard, and A. B. Tsybakov, Oracle inequalities for inverse problems, Ann. Statist., 3(3): , 22. [5] H. W. Engl, M. Hanke and A. Neubauer, Regularization of Inverse Problems. Kluwer, Dordrecht, [6] H. Gfrerer, An a posteriori parameter choice for ordinary and iterated Tikhonov regularization of ill-posed problems leading to optimal convergence rates, Math. Comp., 49(18): , S5 S12, [7] B. Hofmann and P. Mathé, Analysis of profile functions for general linear regularization methods, SIAM J. Numer. Anal., 45(3), (electronic), 27. [8] I. A. Ibragimov and Y. A. Rozanov, Gaussian random processes, volume 9 of Applications of Mathematics, Springer-Verlag, New York, Translated from the Russian by A. B. Aries. [9] Q. Jin, A unified treatment of regularization methods for linear ill-posed problems, Numer. Math. J. Chinese Univ. (English Ser.), 9(1): , 2. [1] Q. Jin, On the iteratively regularized Gauss-Newton method for solving nonlinear ill-posed problems, Math. Comp., 69(232), , 2. [11] Q. Jin, On a class of frozen regularized Gauss-Newton methods for nonlinear inverse problems, Math. Comp., 79(272), , 21. [12] Q. Jin and Z. Y. Hou, On an a posteriori parameter choice strategy for Tikhonov regularization of nonlinear ill-posed problems, Numer. Math., 83(1), , [13] S. Kindermann and A. Neubauer, On the convergence of the quasioptimality criterion for (iterated) Tikhonov regularization, Inverse Probl. Imaging, 2(2), , 28. [14] M. Ledoux and M. Talagrand, Probability in Banach spaces, volume 23 of Ergebnisse der Mathematik und ihrer Grenzgebiete (3) [Results in Mathematics and Related Areas (3)], Springer-Verlag, Berlin, Isoperimetry and processes. [15] O. V. Lepskiĭ, A problem of adaptive estimation in Gaussian white noise, Teor. Veroyatnost. i Primenen., 35(3): , 199. [16] S. Lu and P. Mathé, Varying discrepancy principle as an adaptive parameter selection in statistical inverse problems, submitted, 212. [17] P. Mathé and U. Tautenhahn, Enhancing linear regularization to treat large noise, J. Inverse Ill-Posed Probl., 19(6), , 211. [18] P. Mathé and U. Tautenhahn, Regularization under general noise assumptions, Inverse Problems, 27(3), 3516, 211.

21 RG-RULE FOR STATISTICAL INVERSE PROBLEMS 21 [19] A. Neubauer, The convergence of a new heuristic parameter selection criterion for general regularization methods, Inverse Problems, 24(5), 555 (1pp), 28. [2] T. Raus, The principle of the residual in the solution of ill-posed problems, Tartu Riikl. Ül. Toimetised, (672): 16 26, [21] C. R. Vogel, Computational Methods for Inverse Problems, Frontiers in Applied Mathematics 23, SIAM, Philadelphia, PA, 22.

Statistical Inverse Problems and Instrumental Variables

Statistical Inverse Problems and Instrumental Variables Thorsten Hohage Institut für Numerische und Angewandte Mathematik University of Göttingen Workshop on Inverse and Partial Information Problems: Methodology