generalized modes in bayesian inverse problems

Size: px

Start display at page:

Download "generalized modes in bayesian inverse problems"

Aleesha Newton
5 years ago
Views:

1 generalized modes in bayesian inverse problems Christian Clason Tapio Helin Remo Kretschmann Petteri Piiroinen June 1, 2018 Abstract This work is concerned with non-parametric modes and MAP estimates for priors that do not admit continuous densities, for which previous approaches based on small ball probabilities fail. We propose a novel definition of generalized modes based on the concept of qualifying sequences, which reduce to the classical mode in certain situations that include Gaussian priors but also exist for a more general class of priors. The latter includes the case of priors that impose strict bounds on the admissible parameters and in particular of uniform priors. For uniform priors defined by random series with uniformly distributed coefficients, we show that the generalized modes but not the classical modes as well as the corresponding generalized MAP estimates can be characterized as minimizers of a suitable functional. This is then used to show consistency of nonlinear Bayesian inverse problems with uniform priors and Gaussian noise. 1 introduction The Bayesian approach to inverse problems is attractive as it provides direct means to quantify uncertainty in the solution. Developments in modern sampling methodologies and increases in computational resources have made the Bayesian approach feasible in various large-scale inverse problems, and the approach has gained wide attention in recent years; we refer to the early work [11], the books [15, 31], and the more recent works [9, 20, 30] as well as to the references therein. To motivate our contribution, we recall that the Bayesian approach produces statistical inference based on the likelihood and prior probability distributions of the system, and that all expert information independent of the observation enter the analysis through the prior model. Given some prior distribution µ 0 of the probable values of x, the Bayesian inverse problem is then to quantify the uncertainty in x given some noisy observation y = y δ. It is well-known, e.g., from [30] that under quite general conditions, the posterior µ post is absolutely continuous with respect to the prior and is given by the Bayes formula (1.1) dµ post dµ 0 (x y) = exp( Φ(x; y)), Faculty of Mathematics, University Duisburg-Essen, Thea-Leymann-Strasse 9, Essen, Germany (christian.clason@uni-due.de, remo.kretschmann@uni-due.de) Department of Mathematics and Statistics, University of Helsinki, Gustaf Hallströmin katu 2b, FI Helsinki, Finland (tapio.helin@helsinki.fi,petteri.piiroinen@helsinki.fi) 1

2 where Φ is the negative log-likelihood. Due to the frequent high dimensionality of x in inverse problems, there is a fundamental need to understand whether the problem scales well to the infinite-dimensional case and whether the related estimators have well-behaving its; this leads to the study of non-parametric Bayesian inverse problems. Due to the computational effort required by Markov chain Monte Carlo (MCMC) algorithms in large-scale problems, there has been interest to study how approximate Bayesian methods (see, e.g., [7, 24 26, 29]) can be used effectively in inverse problems. A central estimator used in this research effort is the maximum a posteriori (MAP) estimate, which can be formulated in a finite-dimensional setting as an optimization problem (1.2) x MAP = arg min {Φ(x; y) + R(x)}, x where R is the negative log-prior density. As an optimization problem, (1.2) is relatively fast to solve and is widely used in applications as the basis of, e.g., Laplace approximation [32]. The study of non-parametric MAP estimates or more generally, non-parametric modes was only recently initiated by Dashti et al. in [8] in the context of nonlinear Bayesian inverse problems with Gaussian prior and noise distribution. In the Gaussian case, one has an explicit relation between the right-hand side functional in (1.2) and the Onsager Machlup functional defined via the it of small ball probabilities, which can be used to give a statistical justification for the objective functional that is consistent with the finite-dimensional definition [8]. The difficulty related to non-parametric MAP theory is that the posterior does not have a natural density representation as in the finite-dimensional case, and consequently, the MAP estimate does not have a clear correspondence to an optimization problem such as (1.2). More recently, there have been a series of results [1, 10, 12] that extend the definition and scope of non-parametric MAP estimation. For example, weak MAP estimates were proposed and studied in [12] in the context of linear Bayesian inverse problems with a general class of priors. The authors used the tools from the differentiation and quasi-invariance theory of measures (developed by Fomin and Skorokhod and discussed in detail in [4]) to connect the zero points of the logarithmic derivative of a measure to the minimizers of the Onsager Machlup functional. This program of generalization is motivated by the fact that in practice, due to the ill-posedness and often high dimensionality of the problem, the prior plays a key role in successful stabilization [30] and posterior modeling for the inverse problem. Therefore, the prior needs to be carefully designed to reflect the best possible subjective information available. Here, we only mention priors that reflect the sparsity [19, 21], hierarchical structure [5, 33], or anisotropic features [16] of the unknown. However, in the non-parametric case, all results regarding the MAP estimate known to the authors require continuity of the prior; but even simple one-dimensional discontinuous density can fail to have a mode in terms of the definition given in [8] (see Example 2.2). This is iting, since suitable prior modeling may involve imposing strict bounds on the admissible values of x emerging from some fundamental properties of the application. As a simple example, consider, e.g., the classical inverse problem of X-ray medical imaging [27]. Since it is reasonable to assume that there are no radiation sources inside the patient, the attenuation of X-rays is positive throughout the body. A related situation occurs in electrical impedance tomography and more generally, in parameter estimation problems for partial differential 2

3 equations where the pointwise upper and lower bounds need to be imposed on the parameter to ensure well-posedness of the forward problem; reasonable bounds are often available from a priori information on, e.g., the kinds of tissue or material expected in the region of interest. Furthermore, it is known in the deterministic theory that restriction of the unknown to a compact set can stabilize an inverse problem without additional (possibly undesired) regularization terms; this is sometimes referred to as quasi-solution or Ivanov regularization [13, 14, 23, 28], and the use of pointwise bounds for this purpose has recently been studied in [6, 17]. In the Bayesian approach, this would correspond to priors with densities whose mass is contained in a compact set. Together, this motivates the study of imposing such hard bounds as part of the prior in Bayesian inverse problems. The main contribution of our work in this direction is two-fold: First, we introduce a novel definition of generalized modes for probability measures and characterize conditions under which our definition coincides with the previous definition of modes given in [8]. More precisely, we show that if the Radon Nikodym derivative dµ( h)/dµ of the translated measure µ has certain equicontinuity properties (described in Section 3.1) over a dense set of translations h, any generalized mode is also a mode in the sense of [8]. In particular, we demonstrate that our definition for generalized modes does not introduce pathological modes in the case of continuous densities and that strong and generalized modes coincide for Gaussian measures. Second, we consider uniform priors defined via the random series (1.3) ξ = γ k ξ k ϕ k, k=1 where ξ k U[ 1, 1] are uniformly distributed and γ k are suitable weights; such priors clearly have a discontinuous density. We further show that the uniform prior (1.3) does not have any mode that touches the bounds (i.e., where ξ k { 1, 1} for some k N) and, therefore, it is not guaranteed that the posterior would have any mode. However, we prove that such points are in fact generalized modes of the prior. Furthermore, we show that for the uniform prior defined via (1.3), the definition of the generalized mode characterizes in a natural manner the MAP estimates of (1.1) as minimizers of (1.2). We also provide a weak consistency result regarding the MAP estimates in line of the previous work [1, 8]. This paper is organized as follows. In Section 2, we give our definition of generalized modes and illustrate the definition for the examples of a measure with discontinuous density (which does not admit a strong mode) and of Gaussian measures (where generalized and strong modes coincide). A general investigation of conditions for the generalized and strong modes to coincide is carried out in Section 3. We next construct uniform priors on an infinite-dimensional Banach space and characterize its generalized modes in Section 4. In Section 5, we study Bayesian inverse problems with such priors and derive a variational characterization of generalized modes that play the role of a generalized Onsager Machlup functional. This is used in Section 6 to show consistency of nonlinear Bayesian inverse problems with uniform priors and Gaussian noise. 3

4 2 generalized modes Let X be a separable Banach space and µ be a probability measure on X. Throughout, let B δ (x) X denote the open ball around x X with radius δ, and set M δ := sup µ(b δ (x)) x X for δ > 0. We first recall the definition of a mode introduced in [8]. Definition 2.1. A point ˆx X is called a (strong) mode of µ if µ(b δ ( ˆx)) δ 0 M δ = 1. The modes of the posterior measure µ y are called maximum a posteriori (MAP) estimates. It is straightforward to construct a probability distribution with discontinuous Lebesgue density which does not have a mode according to Definition 2.1. Example 2.2. Let µ be a probability measure on R with density p with respect to the Lebesgue measure, defined via { 1 x if x [0, 1], p(x) p(x) := p(x) := 0 otherwise, R p(x)dx. Clearly, ˆx = 0 maximizes p; however, this is not a strong mode. First, note that for every δ > 0, µ(b δ (δ)) = sup µ(b δ (x)) = M δ. x R Hence of all balls with radius δ, the one around x δ := δ has the highest probability. However, although x δ 0 for δ we have for δ small enough that and hence Definition 2.1 is not satisfied. µ(b δ (0)) δ 0 µ(b δ (δ)) = δ(1 1 2 δ) δ 0 2δ(1 δ) = 1 2 < 1 The example above illustrates that as density functions are considered in more generality (e.g., as elements in L 1 (R)), the concept of maximum point should be chosen carefully. This line of thought motivates the following definition. Definition 2.3. A point ˆx X is called a generalized mode of µ if for every sequence {δ n } n N (0, ) with δ n 0 there exists a qualifying sequence {w n } n N X with w n ˆx in X and (w n )) (2.1) n M δ = 1. n We call generalized modes of the posterior measure µ y generalized MAP estimates. 4

5 Note that by Definition 2.3, every strong mode ˆx X is also a generalized mode with the qualifying sequence w n := ˆx. Furthermore, in Example 2.2, ˆx = 0 is a generalized mode with the qualifying sequence w n := δ n for any positive sequence {δ n } n N with δ n 0. In fact, we can use the following stronger condition to show that a point is a generalized mode. Lemma 2.4. Let ˆx X. If there is a family {w δ } δ >0 X such that w δ ˆx in X as δ 0 and then ˆx is a generalized mode of µ. µ(b δ (w δ )) δ 0 M δ = 1, In Example 2.2, this condition is obviously satisfied for ˆx = 0 and w δ := δ. Remark 2.5. Definition 2.3 can be further generalized by allowing for weakly converging qualifying sequences. Conversely, we can restrict Definition 2.3 by coupling the convergence of w n to that of δ n (cf. Lemma 2.4). As we will show below, our definition has the advantage of not introducing pathological modes in the case of continuous densities (such as Gaussian or Besov distributions) while for the important case of a uniform prior, it introduces modes that are natural in terms of the variational characterization (1.2). We now illustrate some of the key ideas of the generalized mode by considering the case of a Gaussian measure µ on X. We assume that µ is centered and note that the results below trivially generalize to the non-centered case. We will require the following quantitative estimate for Gaussian ball probabilities. Lemma 2.6 ([8, Lem. 3.6]). Let x X and δ > 0 be given. Then there exists a constant a 1 > 0 independent of x and δ such that µ(b δ (x)) µ(b δ (0)) e a1 2 δ 2 e a 1 2 ( x X δ ) 2. Now we can show that for a centered Gaussian measure µ, Definitions 2.1 and 2.3 do indeed coincide. Theorem 2.7. The origin 0 X is both a strong mode and a generalized mode of µ, whereas all x X \ {0} are neither a strong mode nor a generalized mode. Proof. First, by Anderson s inequality (see, e.g., [3, Thm ]) we have that µ(b δ (x)) µ(b δ (0)) for all x X and δ > 0 and hence that M δ = µ(b δ (0)) for all δ > 0. This immediately yields that ˆx = 0 is a strong mode and hence also a generalized mode for µ. Now assume that x X \ {0} is a generalized mode of µ. Let {δ n } n N (0, ) with δ 0 be arbitrary and let {w n } n N X be the respective qualifying sequence. Then, Lemma 2.6 yields µ(b δ (w)) µ(b δ (0)) a 1 e 2 w X ( w X 2δ ) e a 1 2 ( 4 3 x X )( 4 1 x X ) = e 3a 1 32 x 2 X =: A < 1 5

6 for all δ (0, δ 0 ] and w B δ 0 (x) with δ 0 := 1 4 x X. We choose n 0 N large enough such that w n x X < δ 0 and δ n < δ 0 for all n n 0. Now taking the it in the above yields (w n )) (w n )) n M δ = n n µ(b δ A < 1, n (0)) which is a contradiction. Hence, x is not a generalized mode and thus cannot be a strong mode, either. The explicit bound in Lemma 2.6 plays an important role in Theorem 2.7. Compare this to the alternative approach using the Onsager Machlup functional I(x) = 1 2 x 2 E, defined as satisfying µ(b δ ( (x 1 )) 1 δ 0 µ(b δ (x 2 )) = exp 2 x 2 E 2 1 ) 2 x 1 E 2, which holds for all x 1, x 2 from the Cameron Martin space E X of µ by [22, Prop. 18.3]. Although this relation can be used to show that 0 is the only generalized MAP estimate in E, it does not yield any information regarding X \ E. In the next section, we will for general measures µ under additional continuity assumptions extend results from a dense space such as E to the whole space X. 3 relation between generalized and strong modes In this section, we derive conditions under which our definition of generalized modes coincides with the standard notion of strong modes. We do this by based on further characterizations of the convergence of the qualifying sequence in the definition of generalized modes. 3.1 characterization by rate of convergence Consider a general probability measure µ on X. Let us first make the fundamental observation that for δ n > 0 and ˆx, w n X, we have that (3.1) ( ˆx)) sup x X µ(b δ = µ(bδn ( ˆx)) n (x)) (wn )) (w n )) sup x X µ(b δ. n (x)) Clearly, if ˆx is a generalized mode, we have control over the right-most ratio in (3.1). On the other hand, convergence of the ratio on the left-hand side in (3.1) is related to the definition of a strong mode. This leads to the following equivalence. Theorem 3.1. Let ˆx X be a generalized mode of µ. Then ˆx is a strong mode if and only if for every sequence {δ n } n N (0, ) with δ n 0, there exists a qualifying sequence {w n } n N X with w n ˆx and ( ˆx)) (3.2) n (wn )) = 1. 6

7 Proof. Let {δ n } n N be a positive sequence with δ n 0 and let {w n } n N be the corresponding qualifying sequence that satisfies (3.2). Then n ( ˆx)) sup x X (x)) = n ( ˆx)) (wn )) n (w n )) sup x X (x)) = 1 and hence ˆx is a strong mode. Conversely, assume that ˆx is not a strong mode. Then there exists a positive sequence {δ n } n N such that δ n 0 and This, in turn, implies that n n ( ˆx)) (wn )) = n ( ˆx)) sup x X (x)) < 1. ( ˆx)) sup sup x X µ(b δ x X (x)) n (x)) n µ(b δ < 1 n (wn )) for every qualifying sequence {w n } n N. Hence (3.2) does not hold. We can use this theorem to restrict the balls over which the supremum in the definition of M δ is taken. Corollary 3.2. Let ˆx X be a generalized mode of µ. If there exists an r > 0 such that then ˆx is a strong mode. δ 0 µ(b δ ( ˆx)) sup w B r ( ˆx) µ(bδ (w)) = 1, Proof. Let {δ n } n N be a positive sequence such that δ n corresponding qualifying sequence. Then 0, and let {w n } n N X be the n ( ˆx)) (wn )) n ( ˆx)) sup w B r ( ˆx) µ(bδ n (w)) δ 0 µ(b δ ( ˆx)) sup w B r ( ˆx) µ(bδ (w)) = 1 since w n ˆx. Hence, (3.2) holds, and ˆx is therefore a strong mode by Theorem 3.1. We illustrate the possibility of satisfying (3.2) with the following example. Example 3.3. For a probability measure µ on R with continuous density p with respect to the Lebesgue measure, (3.2) is satisfied in every point ˆx R with p( ˆx) > 0. To see this, let ε > 0. Then there exists a δ 0 > 0 such that for all x B δ 0 ( ˆx), p(x) p( ˆx) ε. Therefore, for all δ (0, δ 0 2 ) and w B δ 0 2 ( ˆx), µ(b δ (w)) p( ˆx) 2δ 1 2δ w+δ w δ p(x) p( ˆx) dx ε. 7

8 So, for every positive sequence δ n 0 and every real sequence w n ˆx, (w n )) = p( ˆx) > 0. n 2δ n This holds true in particular for the constant sequence w n = ˆx, so that n ( ˆx)) p( ˆx) µ(b δ = n (wn )) p( ˆx) = 1. The property (3.2) can be seen as the requirement to have sufficiently fast convergence of the qualifying sequences. In other words, we require that the balls B δ ( ˆx) and B δ n (w n ) have asymptotically the same measure. This idea can further quantified by the following theorem. Theorem 3.4. Let ˆx X be a generalized mode of a Borel probability measure µ. If (i) for every positive sequence {δ n } n N with δ 0, there exists a qualifying sequence {w n } n N such that w n ˆx X = 0, n δ n (ii) the family of functions {f n } n N on [0, 1] defined by f n : [0, 1] R, f n (r) := µ(br (δ n+ w n ˆx X ) ( ˆx)) + w n ˆx X ( ˆx)), is equicontinuous at r = 1, then ˆx is a strong mode. Proof. Notice first that by monotonicity of probability, the function ϕ x,s (r) := µ(brs (x)) µ(b s (x)) is a left-continuous increasing function with right its on [0, 1] for every x X and every s > 0 and hence so is f n = ϕ ˆx,δn + w n ˆx X. Let now {δ n } n N be a positive sequence with δ n 0 and let {w n } n N be a qualifying sequence satisfying assumption (i). Setting by the triangle inequality we have r n := w n ˆx X for all n N, ( ˆx)) (w n )) +r n ( ˆx)) r n ( ˆx)) for n large enough. Furthermore, assumption (ii) implies that for every ε > 0, ( ) δn r n f n (1) f n < ε δ n + r n 8

9 for every n large enough, since We thus obtain that r n δ n δ n r n = 1 δ n + r n 1 + r 1 as n. n δn sup n ( ˆx)) (w n )) +r n ( ˆx)) which implies that ( ˆx)) (w n )) ( ( ˆx)) (3.3) n µ(b δ = n+r n ( ˆx)) n µ(b δ 1 µ(bδ n ) (w n )) n+r n ( ˆx)) µ(b δ = 0. n ( ˆx)) Since for every ε > 0 the assumption (ii) yields f n (1) f n (δ n /(δ n + r n )) < ε for every n large, it also follows that ( ˆx)) +r n ( ˆx)) sup n µ(b δ ε n+r n ( ˆx)) and in particular that ( ˆx)) (3.4) n µ(b δ = 1. n+r n ( ˆx)) The two its (3.3) and (3.4) together yield that (w n )) n µ(b δ = 1 n ( ˆx)) ε and, therefore, ˆx is a strong mode by Theorem 3.1. Remark 3.5. The equicontinuity condition (ii) is implied by the η-annular decay property (η-ad) at the generalized mode point (see [2] for the definition). In the finite-dimensional case, the 1-AD property at x is equivalent to f (r) = µ(b r (x)) being locally absolutely continuous on (0, ) and f (r)r c f (r) for some c > 0 and almost every r > 0. For example, it can be seen by direct computation that the function f in Example 2.2 has the 1-AD property at the generalized mode ˆx = characterization by convergence on a dense subspace We next consider the case when the qualifying sequences {w n } n N are restricted to a dense subspace of X where a certain continuity of the ratios holds. Proposition 3.6. Let ˆx X be a generalized mode of a Borel probability measure µ and let E be a dense subset of X. Then for every {δ n } n N (0, ) with δ 0 there exists a sequence { w n } n N E with w n ˆx in X such that n ( w n )) sup x X (x)) = 1. 9

10 Proof. We first show that for every δ > 0, the mapping x µ(b δ (x)) is lower semi-continuous. To this end, let {x n } n N X be a sequence converging to x X and let { 1, if x A, χ A (x) = 0, otherwise, denote the characteristic function of a set A X. Then for every x X, we have and Fatou s Lemma yields µ(b δ (x)) = X inf n χ B δ (x) (x) inf χ n B δ (x n ) (x), χ B δ (x) (x)µ(dx) X X inf n χ B δ (x n ) (x)µ(dx) χ B δ (x n )(x)µ(dx) = inf n µ(bδ (x n )). Now let {δ n } (0, ) with δ n 0 and let {w n } n N X be a corresponding qualifying sequence. Consider a fixed n N. By the lower semi-continuity of x (x)), we can choose an R > 0 such that (v)) (w n )) 1 n Mδ n for all v B R (w n ). Set r = min{r, 1 n }. Because E is dense in X, we can choose a w n B r (w n ), which therefore satisfies both 1 µ(bδ n ( w n )) M δ µ(bδ n (w n )) n M δ 1 n n and w n ˆx X w n w n X + w n ˆx X 1 n + w n ˆx X. As n N was arbitrary, we obtain the desired sequence { w n } n N E with w n ˆx in X and ( w n )) 1 n M δ n (w n )) 1 n M δ n n n = 1. In order to give a sufficient condition for the coincidence of generalized and strong modes, we consider a more specific class of probability measures. Let µ be a Borel probability measure that possesses a dense continuously embedded subspace (E, E ) X of admissible shifts, i.e., the shifted measure µ h := µ( h) is equivalent to µ for every h E, and assume that for every h E the density of µ h with respect to µ has a continuous representative dµ h dµ C(X). Theorem 3.7. Let ˆx X be a generalized mode of µ. If (i) for every {δ n } n N (0, ) with δ n 0 there is a qualifying sequence {w n } n N ˆx + E with w n ˆx E 0, 10

11 (ii) there is an R > 0 such that is continuous in ˆx, then ˆx is a strong mode. f R : (E, E ) R, f R (w) := sup dµ ˆx w (x) 1 x B R ( ˆx) dµ Proof. Choosing N large enough such that δ n R for all n N, we have for all n N that µ(b δ n ( ) (w n )) µ(b δ 1 n ( ˆx)) = 1 dµ ˆx wn µ(b δ (x) 1 µ(dx) n ( ˆx)) B δn ( ˆx) dµ sup dµ ˆx wn (x) 1 1 dµ µ(b δ µ(dx) f R (w n ). n ( ˆx)) x B δn ( ˆx) B δn ( ˆx) However, f R (w n ) f R ( ˆx) = 0 by the continuity of f R and the convergence w n ˆx in E, so that (w n )) n µ(b δ = 1. n ( ˆx)) Hence, ˆx is a strong mode by Theorem 3.1. For a nondegenerate Gaussian measure µ = N(a, Q) on a separable Hilbert space X with mean a X and covariance operator Q L(X ), the assumptions of Theorem 3.7 are fulfilled for E = Q(X) and ˆx = w n := a for all n N as well as any R > 0. Corollary 3.8. Let ˆx X be a generalized mode of µ that satisfies condition (i) of Theorem 3.7. If additionally dµ ˆx w ( ˆx) = 1 w E ˆx dµ and there is an r > 0 such that the family { } dµ ˆx w : w B r E dµ ( ˆx), B r E ( ˆx) := {x ˆx + E : x ˆx E < r } ˆx + E, is equicontinuous in ˆx, then ˆx is a strong mode Proof. We show that condition (ii) of Theorem 3.7 is satisfied as well, from which the claim then follows. For a given ε > 0 we choose 0 < r 0 r small enough such that for all w B r 0 E ( ˆx) we have that dµ ˆx w ( ˆx) 1 dµ ε 2. If we also choose 0 < R r 0 such that for all x BX R ( ˆx) and all w Br E (w) we have that dµ ˆx w dµ (x) dµ ˆx w dµ ( ˆx) ε 2, 11

12 then the triangle inequality yields that f R (w) f R ( ˆx) = sup x B R ( ˆx) dµ ˆx w dµ (x) 1 0 ε for all w B r 0 E ( ˆx) and therefore the equicontinuity of f at ˆx. 4 modes of uniform priors We now demonstrate the usefulness of the concept of generalized modes for a class of uniform probability measures on infinite-dimensional spaces that can serve as priors in Bayesian inverse problems. We first discuss the rigorous construction of the uniform probability measure and then show that such measures admit generalized but in general not strong modes. 4.1 construction of the probability measure We proceed similar as in [9], with the difference that we define a probability measure on a subspace of l rather than of L. Let us first fix some notation. For x := {x k } k N l, we write x = sup k N x k. Furthermore, let e j l for j N denote the standard unit vector in l, i.e., [e k ] j = 1 for j = k and 0 else. We then define (4.1) X := span{e k } k N l and note that X = {x l : k x k = 0} =: c 0. We thus have that (X, ) is a separable Banach space. We now construct a class of probability measures on X whose mass is concentrated on a set of sequences with strictly bounded components. First, we define a random variable ξ according to the random series (4.2) ξ := where γ k ξ k e k, (i) ξ k U[ 1, 1] (i.e., uniformly distributed on [ 1, 1]) and (ii) γ k 0 for all k N with γ k 0. k=1 Note that the partial sums ξ n := n k=1 γ kξ k e k almost surely form a Cauchy sequence in X, since for all N,m, n N with N m n we have that ξ n ξ m = n k=m+1 γ kξ k e k = sup m+1 k n γ k ξ k sup k N γ k, and the right hand side tends to zero as N. Since X is complete, the series (4.2) therefore converges almost surely. We can thus define the probability measure µ γ on X by (4.3) µ γ (A) := P [ξ A] for every A B(X ), 12

13 The following sets will be important for our study of the generalized and strong modes of µ γ. We define and for every δ > 0, E γ := {x X : x k γ k for all k N}, as well as Eγ δ := {x X : x k max{γ k δ, 0} for all k N} E γ Eγ 0 := Eγ δ. δ >0 We first collect some basic properties of E γ and E δ γ. Proposition 4.1. The sets E γ and E δ γ are convex, compact, and have empty interior. Proof. We only consider the case of E γ ; the case of E δ γ follows analogously. First, convexity follows directly from the definition since for every x, y E γ and λ [0, 1] we have that {λx + (1 λ)y} k = λx k + (1 λ)y k λ x k + (1 λ) y k γ k for all k N, and hence that λx + (1 λ)y E γ. For the compactness of E γ, we first show that X \ E γ is open. Let x X \ E γ be arbitrary. Then there exists an m N with x m > γ m, so that for ε := x m γ m we have B ε (x) X \ E γ as claimed. Hence as a closed subset of a complete space, E γ is itself complete, which allows us to show compactness by constructing a finite covering of E γ by balls of radius ε. Let ε > 0. As γ k 0, there is an N N such that for every x E γ and all k N + 1, x k γ k < ε. Now choose M N with Mε max{γ k : k = 1,..., N }. For every x E γ and k {1,..., N }, there is a λ k { M,..., M} such that x k λ k ε < ε. Hence, x z N < ε forz N := N k=1 λ kεe k, which implies that N E γ B ε (z N ) k=1 is the desired finite covering. Finally, to show that E γ has empty interior, let x E γ and ε > 0 be arbitrary. We choose m N such that γ m < ε 3. Then, y := x + 3γ mϕ m B ε (x), because y x = 3γ m ϕ m = 3γ m < ε, but y E γ. Consequently, x E γ and hence the interior of E γ is empty. Of particular use for finding both generalized and strong modes will be the metric projections onto E δ γ. For any δ > 0, let P δ : X Eγ δ, P δ (x) := x with x x = inf z x, z Eγ δ 13

14 which is well-defined because E δ γ is closed and convex by Proposition 4.1. It is straightforward to show by case distinction that this projection can be characterized componentwise for every k N via 0 if γ k < δ, (4.4) [P δ γ k δ if x k > γ k δ > 0, (x)] k = x k if x k [ γ k + δ,γ k δ], γ k + δ if x k < γ k + δ < 0. The projection satisfies the following properties. Lemma 4.2. Let x E γ. Then, δ 0 P δ x = x in X. Proof. This follows directly from the definition since {P δ x} k x k δ for all k N. Finally, we can give a more explicit characterization of E 0 γ. Lemma 4.3. We have that E 0 γ = {x X : x k < γ k for all k N, x k 0 for finitely many k N}. Proof. By definition, for every x E 0 γ there exists a δ > 0 with x E δ γ This implies that x k max{γ k δ, 0} < γ k for all k N. Moreover, since γ k 0 there exists an N N with x k max{γ k δ, 0} = 0 for all k N. This yields γ k δ 0 for all k N. Now let x X be an element from the set on the right hand side. As x k < γ k for all k N and because there are only finitely many k N with x k 0, we can choose δ 0 := min{γ k x k : k N with x k 0} > 0 such that x E δ 0 γ E 0 γ. 4.2 small ball probabilities We next study the behavior of small balls under the probability measure µ γ, which is crucial for determining modes. For the sake of conciseness, let (4.5) J δ γ (x) := µ γ (B δ (x)) for all x X, for which we have the following straightforward characterization. Lemma 4.4. For every x X and δ > 0, it holds that µ γ (B δ (x)) = J δ γ (x) = where A denotes the closure of a set A X. P [γ k ξ k (x k δ, x k + δ)], k=1 14

15 Proof. First of all, by definition J δ γ (x) = µ γ (B δ (x)) = P [ ] ξ B δ (x) = P [ γ k ξ k x k < δ for all k N]. As both γ k 0 and x k 0, there exists an N N such that γ k + x k δ 2 for all k N + 1. This implies that since ξ k [ 1, 1] almost surely. Consequently, P [ γ k ξ k x k < δ for all k N + 1] = 1 µ γ (B δ (x)) = P [ γ k ξ k x k < δ for k = 1,..., N ] P [ γ k ξ k x k < δ for all k N + 1] N = P [ γ k ξ k x k < δ] k=1 by the independence of the ξ k. This yields the second identity. The first identity now follows from P [γ k ξ k [x k δ, x k + δ]] = P [γ k ξ k (x k δ, x k + δ)] for all k N. We can use this characterization to show that for every δ > 0, the origin maximizes J δ γ. Proposition 4.5. Let δ > 0. Then J δ γ (0) = max x X J δ γ (x) > 0. Proof. From the second equality in Lemma 4.4, we have that µ γ (B δ (x)) = P [γ k ξ k (x k δ, x k + δ)] k=1 P [γ k ξ k ( δ, δ)] = µ γ (B δ (0)) k=1 for every x = {x k } k N X. On the other hand, µ γ (B δ (0)) sup x X µ γ (B δ (x)), which shows that x = 0 is a maximizer. For the positivity, note that as γ k 0, there exists an N N such that µ γ (B δ (0)) = N P [γ k ξ k ( δ, δ)] > 0 k=1 since P [γ k ξ k ( δ, δ)] > 0 for all k N. Crucially, J δ γ is constant on E δ γ. Proposition 4.6. For every δ > 0, J δ γ (x 1 ) = J δ γ (x 2 ) for all x 1, x 2 E δ γ. 15

16 Proof. Let x 1, x 2 Eγ δ. For every k N, we distinguish between the following two cases: (i) δ γ k : In this case, (x 1,k δ, x 1,k + δ) ( γ k,γ k ) and (x 2,k δ, x 2,k + δ) ( γ k,γ k ), so that P [ γ k ξ k (x 1,k δ, x 1,k + δ) ] = δ = P [ γ k ξ k (x 2,k δ, x 2,k + δ) ]. γ k (ii) δ > γ k : In this case, x 1,k = x 2,k = 0 by definition of E δ γ, and hence Together we obtain that P [ γ k ξ k (x 1,k δ, x 1,k + δ) ] = 1 = P [ γ k ξ k (x 2,k δ, x 2,k + δ) ]. J δ γ (x 1 ) = P = [ ] ξ B δ (x 1 ) = P [ γ k ξ k (x 1,k δ, x 1,k + δ) ] k=1 P [ γ k ξ k (x 2,k δ, x 2,k + δ) ] = P k=1 [ ] ξ B δ (x 2 ) = Jγ δ (x 2 )) as claimed. Since 0 E δ γ, combining Propositions 4.5 and 4.6 immediately yields that every x E δ γ maximizes J δ γ. Corollary 4.7. For δ > 0 and every x E δ γ, 4.3 modes of the probability measure J δ γ ( x) = max x X J δ γ (x) > 0. Finally, we characterize both the strong and the generalized modes of the uniform probability measure and show that these do not coincide. Theorem 4.8. Every point ˆx E 0 γ is a strong mode of µ γ. Proof. By definition of E 0 γ there is a δ > 0 such that ˆx E δ γ. Then by Corollary 4.7, and therefore M δ = max x X µ γ (B δ (x)) = µ γ (B δ ( ˆx)) µ γ (B δ ( ˆx)) µ γ (B δ ( ˆx)) δ 0 M δ = δ 0 µ γ (B δ ( ˆx)) = 1. 16

17 However, the following results shows that Definition 2.1 is too restrictive in this case and in fact can be unintuitive. Proposition 4.9. The following claims hold: (i) If for x E γ there is an m N with x m = γ m > 0, then x is not a strong mode of µ γ. (ii) There are γ, x X with x k < γ k for all k N such that x is not a strong mode of µ γ. Proof. Ad (i): First, note that for δ > 0 small enough, Moreover, for all k N and δ > 0, so that P [γ m ξ m (x m δ, x m + δ)] = 1 2 P [γ mξ m ( δ, δ)]. P [γ k ξ k (x k δ, x k + δ)] P [γ k ξ k ( δ, δ)] µ γ (B δ (x)) µ γ (B δ (0)) = k=1 P [γ k ξ k (x k δ, x k + δ)] P [γ k ξ k ( δ, δ)] for δ > 0 small enough. But by Proposition 4.5, we also have that sup δ 0 µ γ (B δ (x)) M δ = sup δ µ γ (B δ (x)) µ γ (B δ (0)) 1 2 < 1. Hence, x cannot be a strong mode. Ad (ii): We take γ k = 1 k and x k = 1 2 γ k = 1 2k for all k N. For given δ (0, 1 4 ) we also choose n N such that 1 n δ > 1 n+1, i.e., 1 2 γ n < δ γ n. Hence, as well as for n 4. In addition, Consequently, P [γ n ξ n (x n δ, x n + δ)] = γ n (x n δ) 2γ n γ n ( 1 2 γ n) 2γ n = 3 4, P [γ n ξ n ( δ, δ)] = δ γ n > n n P [γ k x k (x k δ, x k + δ)] P [γ k x k ( δ, δ)] for all k N. µ γ (B δ (x)) µ γ (B δ (0)) = k=1 P [γ k ξ k (x k δ, x k + δ)] P [γ k ξ k ( δ, δ)] P [γ nξ n (x n δ, x n + δ)] P [γ n ξ n ( δ, δ)] Proposition 4.5 now yields that sup δ 0 µ γ (B δ (x)) M δ and hence that x is not a strong mode. = sup δ = µ γ (B δ (x)) µ γ (B δ (0)) < 1. 17

18 On the other hand, every point in E γ is a generalized mode and vice versa. Theorem A point x X is a generalized mode of µ γ if and only if x E γ. Proof. Assume first that x E γ and set w δ := P δ x E δ γ for all δ > 0. Then, w δ x as δ 0 by Lemma 4.2 and µ γ (B δ (w δ )) = µ γ (B δ (0)) by Proposition 4.6, so that µ γ (B δ (w δ )) µ γ (B δ (0)) = 1 for all δ > 0. Since M δ := max x X µ γ (B δ (x)) = µ γ (B δ (0)) by Proposition 4.5, it follows from Lemma 2.4 that x is a generalized mode. Conversely, assume that x X \ E γ. Then there exists an m N with x m > γ m. Taking now δ 0 := x m γ m > 0, we have that P [γ m ξ m (x m δ, x m + δ)] = 0 for all δ (0, δ 0 ) and hence that This implies that µ γ (B δ (x)) = P [γ k ξ k (x k δ, x k + δ)] = 0 for all δ (0, δ 0 ). k=1 µ γ (B δ ( (w)) M δ = 0 for all δ 0, δ ) 0 2 and hence x cannot be a generalized mode. and w B δ 0 2 (x), 5 variational characterization of generalized map estimates In this section, we consider Bayesian inverse problems with uniform priors and characterize the corresponding generalized MAP estimates i.e., the generalized modes of the posterior distribution as minimizers of an appropriate objective functional. Let X be the separable Banach space defined by (4.1) and choose as prior µ 0 the uniform probability distribution µ γ defined by (4.3) for some non-negative sequence of weights γ k 0. We assume that for given data y from a Banach space Y the posterior distribution µ y satisfies µ y µ 0 and can be expressed as (5.1) µ y (A) = 1 Z(y) for all A B(X ), where Z(y) := X A exp( Φ(x; y))µ 0 (dx) exp( Φ(x; y))µ 0 (dx) is a positive and finite normalization constant and Φ: X Y R is the likelihood. This way, µ y constitutes a probability measure on X. Throughout this section, we fix y Y and abbreviate Φ(x; y) by Φ(x). We make the following assumption on the likelihood. 18

19 Assumption 5.1. The function Φ : X R is Lipschitz continuous on bounded sets, i.e., for every r > 0, there exists L = L r > 0 such that for all x 1, x 2 X with x 1 X, x 2 X r we have Φ(x 1 ) Φ(x 2 ) L x 1 x 2 X. In Theorem 4.10, we have seen that the indicator function ι Eγ of E γ in the sense of convex analysis, i.e., { 0, if x E γ, ι Eγ : X R := R { }ι Eγ (x) :=, otherwise. is the suitable functional to minimize in order to find the generalized modes of the prior measure µ 0. In contrast, the common Onsager Machlup functional is not defined for µ 0, as µ 0 is not quasi-invariant with respect to shifts along any direction x X \ {0}. The goal of this section is to relate a similar functional that characterizes generalized MAP estimates to a suitably generalized it of small ball probabilities. Specifically, we define the functional { Φ(x), if x E γ, (5.2) I : X R, I(x) := Φ(x) + ι Eγ (x) =, otherwise. Furthermore, for δ > 0 we denote by J δ (x) := µ y (B δ (x)) for all x X the small ball probabilites of the posterior measure. Let x δ denote a (not necessarily unique) maximizer of J δ in Eγ δ, i.e., J(x δ ) = max J δ (x). x Eγ δ What we prove below (in Theorem 5.9) is the following: (i) {x δ } δ >0 contains a convergent subsequence (because E γ is compact) and the it of the convergent subsequence minimizes I. (ii) Any minimizer of I is a generalized MAP estimate and vice versa. We first demonstrate the necessity to work with subsequences by an example which shows that the sequence {x δ } δ >0 may not have a unique it. Example 5.2. Define д(0) = 0, д(x) = n, if 1 < x 2n+1 ( ) n 1, n Z 2 and consider a probability measure µ on R with density f with respect to the Lebesgue measure, defined via { f (x) := max 1 д(x 1), 1 ( ) } x + 1 f (x) 2д, 0, f (x) := 2 f (x)dx. R 19

20 Figure 1: probability density f constructed in Example 5.2 for which the family of maximizers {x δ } δ >0 has two cluster points ˆx 1 = 1 and ˆx 2 = 1 To find the set of points x for which the probability of B δ (x) is maximal under µ for given 1 δ (0, 1), first let n N be such that < δ 1 2 n+1 2. By a case distinction, one can then verify n that [ ] ( ] δ, δ if δ 1, 2, arg max µ(b δ 2 (x)) = n+1 2 n+1 2 n+1 2 n+1 [ x R δ, n 2 δ ] ( ] if δ 2 1, n 2 n+1 2, n Hence, any family {x δ } δ >0 of maximizers x δ arg max x R µ(b δ (x)) has the two cluster points 1 and 1; see Figure 1. We begin our analysis by collecting some auxiliary results on Φ and I. Lemma 5.3. If Assumption 5.1 holds, then there exists a constant M > 0 such that (5.3) Φ(x) M for all x E γ. Proof. Since E γ is compact and therefore bounded by Proposition 4.1, there exists some R 0 such that x R for all x E γ. The Lipschitz continuity of Φ on bounded sets then implies that for any x E γ, Φ(x) Φ(x) Φ(0) + Φ(0) L x + Φ(0) LR + Φ(0) =: M. Lemma 5.4. If Assumption 5.1 holds, then there exists an x E γ such that I( x) = min x X I(x). Proof. As I(x) = for all x X \ E γ, we only need to consider I Eγ = Φ Eγ, which is Lipschitz continuous on the bounded set E γ by Assumption 5.1. Furthermore, the set E γ is nonempty and by Proposition 4.1 closed and compact. The Weierstrass Theorem therefore implies that I attains its minimum in a point x E γ. We also need the following results on the small ball probabilities of the posterior. 20

21 Lemma 5.5. Assume that Assumption 5.1 holds. Then for every δ > 0 there exists a constant c(δ) > 0 such that J δ (x) c(δ) > 0 for all x E δ γ. Proof. First, µ 0 (B δ (x)) = µ 0 (B δ (0)) > 0 for all x Eγ δ by Proposition 4.6 and Corollary 4.7. Hence we can estimate using Lemma 5.3 that J δ (x) = 1 exp( Φ(x))µ 0 (dx) 1 Z(y) Z(y) e M µ 0 (B δ (0)) =: c(δ) > 0, B δ (x) because Z(y) is positive and finite by assumption. We next show show that J δ is maximized within E δ γ by making use of the metric projection P δ : X E δ γ defined in (4.4). Lemma 5.6. For all δ > 0, we have that (5.4) J δ (P δ x) J δ (x) for all x X. Proof. We note that B δ (x) \ B δ (P δ x) X \ E γ, so that µ 0 (B δ (x) \ B δ (P δ x)) = 0. Moreover, µ 0 (B δ (P δ x) \ B δ (P δ x)) = 0 by Lemma 4.4, so that µ 0 (B δ (x) \ B δ (P δ x)) = 0 as well. With this knowledge we estimate J δ (x) = 1 exp( Φ(x))µ 0 (dx) Z(y) B δ (x) 1 exp( Φ(x))µ 0 (dx) Z(y) B δ (x) B δ (P δ x) = 1 exp( Φ(x))µ 0 (dx) = J δ (P δ x). Z(y) Corollary 5.7. For all δ > 0, B δ (P δ x) J(x δ ) := max J δ (x) = max J δ (x). x Eγ δ x X Proof. If x X is a maximizer of J δ, then P δ x E δ γ is a maximizer as well. The following proposition indicates that the functional I plays the role of a generalized Onsager Machlup functional for the posterior distribution µ y associated with the uniform prior µ 0 defined by (4.3). Proposition 5.8. Suppose that Assumption 5.1 holds. Let x 1, x 2 E γ and {w δ 1 } δ >0, {w δ 2 } δ >0 E γ with (i) w δ 1, wδ 2 Eδ γ for all δ > 0, (ii) w δ 1 x 1 and w δ 2 x 2 as δ 0. 21

22 Then, (5.5) δ 0 J δ (w δ 1 ) J δ (w δ 2 ) = exp(i(x 2) I(x 1 )). Proof. Let δ > 0. We consider J δ (w 1 δ ) J δ (w 2 δ ) = B δ (w1 δ ) exp( Φ(x))µ 0(dx) B δ (w2 δ ) exp( Φ(x))µ 0(dx) B = exp(φ(x 2 ) Φ(x 1 )) δ (w1 δ ) exp(φ(x 1) Φ(x))µ 0 (dx) B δ (w δ2 ) exp(φ(x 2) Φ(x))µ 0 (dx), which is well-defined by Lemma 5.5. By Assumption 5.1, ( ) Φ(x 1 ) Φ(x) L x 1 x L x 1 w1 δ + w1 δ x L ( ) x 1 w1 δ + δ for all x B δ (w δ 1 ) and Φ(x 2 ) Φ(x) L x 2 x L ( ) x 2 w2 δ + δ for all x B δ (w δ 2 ), where L is the Lipschitz constant of Φ on the bounded set x E γ B δ (x). It follows that exp(φ(x 2 ) Φ(x 1 ))e L( x 1 w1 δ + x 2 w2 δ +2δ) µ 0(B δ (w 1 δ )) µ 0 (B δ (w 2 δ )) J δ (w 1 δ ) J δ (w 2 δ ) exp(φ(x 2 ) Φ(x 1 ))e L( x 1 w δ 1 + x 2 w δ 2 +2δ) µ 0(B δ (w δ 1 )) µ 0 (B δ (w δ 2 )). Now Proposition 4.6 implies that µ 0 (B δ (w 1 δ )) = µ 0(B δ (w 2 δ )) for all δ > 0, and hence taking the it δ 0 and using (ii) together with the fact that I(x) = Φ(x) for all x E γ yields the claim. Note that Lemma 4.2 immediately implies that (5.5) also holds true for the prior distribution µ 0 if I is replaced by ι Eγ. However, it is an open question how Proposition 5.8 can be generalized to cover a wider class of distributions while sustaining the connection to generalized MAP estimates described in the following theorem, which is the main result of this section. Theorem 5.9. Suppose Assumption 5.1 holds. Then the following hold: (i) For every {δ n } n N (0, ), the sequence {x δ n } n N contains a subsequence that converges strongly in X to some x E γ. (ii) Any cluster point x E γ of {x δ n } n N is a minimizer of I. 22

23 (iii) A point ˆx X is a minimizer of I if and only if it is a generalized MAP estimate for µ y. Proof. Ad (i): By definition, x δ E δ γ E γ for all δ > 0. Since E γ is compact and closed by Proposition 4.1, there exists a convergent subsequence, again denoted by {x δ n } n N, with it x E γ. Ad (ii): Let now x E γ be the it of an arbitrary convergent subsequence still denoted by {x δ n } n N and assume that x is not a minimizer of I. By Lemma 5.4, a minimizer x E γ of I exists and by assumption satisfies I( x) > I(x ). Moreover, P δ x x as δ 0 by Lemma 4.2. Now the definition of x δ and Proposition 5.8 yield µ y (B δ n (x δ n )) 1 n µ y (B δ n (P δ nx )) = exp(i(x ) I( x)) < 1, a contradiction. So x is in fact a minimizer of I. Ad (iii): Now let x X be a minimizer of I which by definition satisfies x E γ. To see that x is a generalized MAP estimate, we choose w δ := P δ x E δ γ for every δ > 0, which implies that w δ x δ. Furthermore, by definition of x δ and M δ, µ y (B δ (x δ )) = max x X µy (B δ (x)) = M δ for all δ > 0. By Assumption 5.1 and the boundedness of x E γ B δ (x), we can use the Lipschitz continuity of Φ to obtain the estimate ) Φ(x) Φ(x ) + L ( x w δ + w δ x Φ(x ) + 2Lδ for all x B δ (w δ ). On the other hand, a similar argument shows that Consequently, and Φ(x) Φ(x δ ) L x x δ Φ(x δ ) Lδ for all x B δ (x δ ). µ y (B δ (w δ )) = 1 exp( Φ(x))µ 0 (dx) Z(y) B δ (w δ ) exp( Φ(x 1 ) 2Lδ) Z(y) µ 0(B δ (w δ )), M δ = µ y (B δ (x δ )) = 1 exp( Φ(x))µ 0 (dx) Z(y) B δ (x δ ) exp( Φ(x δ 1 ) + Lδ) Z(y) µ 0(B δ (x δ )). As µ 0 (B δ (w δ ) = µ 0 (B δ (x δ )) by Proposition 4.6, combining these two estimates yields µ y (B δ (w δ )) µ y (B δ (x δ )) exp(φ(xδ ) Φ(x ) 3Lδ). Now the minimizing property of x implies that µ y (B δ (w δ )) δ 0 µ y (B δ (x δ )) exp( 3Lδ) = 1 δ 0 23

24 and hence that x is a generalized MAP estimate. Conversely, let ˆx be a generalized MAP estimate and {δ n } (0, ) such that δ n 0 with corresponding qualifying sequence {w n } n N X, i.e., w n ˆx and µ y (B δ n (w n )) n M δ = 1. n From Lemma 4.2, it follows that P δ n w n ˆx as well. We then have ˆx E γ, because otherwise the closedness of E γ by Proposition 4.1 would imply that µ y (w n ) = 0 for n large enough. Also, by definition of x δ, µ y (B δ (x δ )) = max x X µy (B δ (x)) = M δ for all δ > 0. Now by (i) and (ii), we may extract a subsequence, again denoted by {x δ n } n N, such that x δ n x E γ and x is a minimizer of I. Proposition 5.8 and Lemma 5.6 then yield that µ y (B δ n (P δ n w n )) µ y (B δ n (w n )) exp(i( x) I( ˆx)) = n µ y (B δ n (x δ n )) n µ y (B δ = 1. n (x δ n )) Hence, I( x) I( ˆx), i.e., ˆx is a minimizer of I. Remark Theorem 5.9 shows that for inverse problems subject to Gaussian noise as in the following section, the generalized MAP estimate coincides with Ivanov regularization with the specific choice of the compact set E γ ; compare (6.1) below with, e.g., [14, 23, 28]. Hence, Ivanov regularization can be considered as a non-parametric MAP estimate for a suitable choice of the compact set the solution is restricted to. Furthermore, we point out that for linear inverse problems where the Ivanov functional is convex, the minimizers generically lie on the boundary of the compact set; see, e.g., [28, Prop. 2.2 (iii)] and [6, Cor. 2.6]. However, this is not possible for strong modes due to Proposition 4.9 (i), which further indicates the need to consider generalized modes in this setting. 6 consistency in nonlinear inverse problems with gaussian noise Finally, we show consistency in the small noise it of the generalized MAP estimate from Section 6 for the sepcial case of finite-dimensional data and additive Gaussian noise. Let Y = R K for a fixed K N, endowed with the Euclidean norm, and F : X Y be a closed nonlinear operator. Assuming x µ 0 for the prior measure µ 0 defined via (4.3) and y Y is a corresponding measurement of F(x) corrupted by additive Gaussian noise with mean 0 and positive definite covariance operator Σ R K K scaled by δ > 0, the corresponding posterior measure µ y is given by dµ y dµ 0 (x) = 1 Z(y) exp( Φ(x; y)) for µ 0-almost all x X, where Φ(x; y) := 1 2δ 2 Σ 1 2 (F(x) y) 2 Y 24

25 for all x X and y Y. Furthermore, we assume that for every R > 0, the restriction of F to B R (0) is Lipschitz continuous, so that Assumption 5.1 is satisfied. Then Theorem 5.9 yields that the generalized MAP estimates for µ y are given by the minimizers of the functional (6.1) I(x) = 1 2δ 2 Σ 1 2 (F(x) y) 2 Y + ι Eγ (x). Now let {δ n } n N (0, ) with δ n 0 and consider a frequentist setup where we have a true solution x E γ and a sequence {y n } n N Y of measurements given by y n = F(x ) + δ n η n, where η n N(0, Σ) are independently and identically distributed. Moreover, define (6.2) I n (x) := 1 2δn 2 Σ 2 1 (F(x) yn ) Y 2 + ι E γ (x) for all n N and x X. Let x n E γ denote a minimizer of I n for all n N. Then we have the following consistency result. Theorem 6.1. Suppose that F: X Y is closed and x E γ. Then any sequence {x n } n N of minimizers of (6.2) contains a convergent subsequence whose it x E γ satisfies F( x) = F(x ) almost surely. Proof. Let e k (x) := x k for all x X and k N. As x n E γ for all n N, the sequence {x n } n N is almost surely bounded with e k, x n X = [x n ] k γ k almost surely for all k N. We can thus extract for every k N a subsequence with E [ e k, x n X ] xk and x k γ k. Here and in the following, we pass to the subsequence without adapting the notation. By a diagonal sequence argument, we can thus construct a sequence, again denoted by {x n } n N, satisfying E [ e k, x n X ] xk for all k N and x := k N x k e k E γ. We now write for every v X l 1 and N N E [ v, x n x X ] = [ E k=1 v, e k X e k, x ] n x X N v, e k X [ E e k, x ] n x X + k=1 v X k=n +1 v, e k X 2γ k sup [ E e k, x ] n x X + 2 v X sup γ k. k {1,..., N } k N +1 Since γ k, we can for every ε > 0 choose N large enough such that 2 v X sup γ k ε k N +1 2, 25

26 and then choose n 0 large enough such that also v X E [ e k, x n x X ] ε 2 since E [ e k, x n X ] e k, x X. Consequently, for all n n 0 and k {1,..., N } E [ v, x n x X ] ε for all n n 0, i.e., E [ v, x n x X ] 0. From this we conclude that v, x n x X 0 in probability as n for all v X. This implies the existence of a subsequence that converges weakly to x almost surely. Therefore, by compactness of E γ, a further subsequence converges strongly to x almost surely. Furthermore, inserting the definition of y n yields that I n (x) = 1 2δn 2 Σ 2 1 (F(x ) F(x) + δ n η n ) Y 2 = 1 2δn 2 Σ 2 1 (F(x ) F(x)) Y Σ 2 1 (F(x ) F(x)), Σ 2 1 ηj Y + 1 δ n 2 n Σ 2 1 ηj Y 2 for all x E γ. Note that the first two terms vanish for x = x E γ and that I n (x n ) I n (x ) by definition of x n. Rearranging this inequality and using Cauchy s and Young s inequalities, we thus obtain that Σ 1 2 (F(x ) F(x n )) 2 Y 2δ n Σ 1 2 (F(x ) F(x n )), Σ 1 2 ηj Y 2δ n Σ 1 2 (F(x ) F(x n )) Y Σ 1 2 ηj Y j=1 1 2 Σ 1 2 (F(x ) F(x n )) 2 Y + 2δ 2 n Σ 1 2 ηj 2 Y. Consequently, and hence E Σ 1 2 (F(xn ) F(x )) 2 Y 4δ 2 n Σ 1 2 ηj 2 Y, [ ] [ ] Σ 2 1 (F(xn ) F(x )) Y 2 4δnE 2 Σ 2 1 ηj Y 2 = 4Kδn 2 0, which also implies that Σ 1 2 (F(xn ) F(x )) Y 0 in probability. After passing to a further subsequence, we thus obtain that F(x n ) F(x ) almost surely. The closedness of F then yields that F( x) = F(x ) almost surely. Note that although Theorem 6.1 does not require the Lipschitz continuity of F on bounded sets, in general without it we do not know if the minimizers of I n exist or if they are generalized MAP estimates for µ y. By a standard subsequence-subsequence argument, we can obtain convergence of the full sequence under the usual assumption on F. 26

27 Corollary 6.2. If F is injective, then x n x in probability as n. Proof. We can apply the proof of Theorem 6.1 to every subsequence of {x n } n N to obtain a further subsequence that converges to x = x almost surely. This in turn is equivalent to the convergence of the whole sequence to x in probability by [18, Cor. 6.13]. Remark 6.3. Consistency in the large sample size it can be shown analogously. Specifically, assuming a fixed noise level δ > 0 (which without loss of generality we can fix at δ = 1) and n independent measurements y 1,..., y n Y, the posterior measure µ n on X has the density dµ n dµ 0 (x) := ( n j=1 ) 1 Z(y j ) exp ( 1 2 ) n Σ 2 1 (F(x) yj ) Y 2 Again we consider a frequentist setup with a true solution x E γ and j=1 y j = F(x ) + η j for all j N. for µ 0 -almost all x X Then it can be shown as in Theorem 6.1 that any sequence of minimizers x n E γ of I n (x) := 1 2 n Σ 2 1 (F(x) yj ) Y 2 + ι E γ (x) j=1 contains a convergent subsequence whose it x E γ satisfies F( x) = F(x ) almost surely and that the full sequence converges to x in probability if F is injective. 7 conclusion We have proposed a novel definition of generalized modes and corresponding MAP estimates for non-parametric Bayesian statistics. Our approach extends the construction of [8] by replacing the fixed base point with a qualifying sequence in the convergence of small ball probabilities. This allows covering cases where the previous approach fails, such as that of priors having discontinuous densities. Our definition coincides with the definition from [8] for Gaussian priors as well as for priors which admit a qualifying sequence satisfying additional convergence properties. For uniform priors defined via a random series with bounded coefficients, such generalized modes can be shown to exist. Furthermore, the corresponding generalized MAP estimates in Bayesian inverse problems (i.e., generalized modes of the posterior distribution) can be characterized as minimizers of a functional that can be seen as a generalization of the Onsager Machlup functional in the Gaussian case. For inverse problems with finite-dimensional Gaussian noise, the generalized MAP estimates are consistent in the small noise as well as in the large sample it. This work can be extended in several directions. The first question is whether qualifying sequences can be constructed from MAP estimates for a sequence of finite-dimensional problems, i.e., whether our generalized MAP estimates arise as Γ-its of finite-dimensional MAP estimates. Also, it would be of interest to study Gaussian priors with non-negativity constraints, where 27

Bayesian inverse problems with Laplacian noise

Bayesian inverse problems with Laplacian noise Remo Kretschmann Faculty of Mathematics, University of Duisburg-Essen Applied Inverse Problems 2017, M27 Hangzhou, 1 June 2017 1 / 33 Outline 1 Inverse heat