arxiv: v2 [math.na] 16 May 2014

Similar documents
Morozov s discrepancy principle for Tikhonov-type functionals with non-linear operators

Convergence rates for Morozov s Discrepancy Principle using Variational Inequalities

Levenberg-Marquardt method in Banach spaces with general convex regularization terms

The impact of a curious type of smoothness conditions on convergence rates in l 1 -regularization

Regularization in Banach Space

Iterative regularization of nonlinear ill-posed problems in Banach space

Convergence rates in l 1 -regularization when the basis is not smooth enough

ETNA Kent State University

A NEW ITERATIVE METHOD FOR THE SPLIT COMMON FIXED POINT PROBLEM IN HILBERT SPACES. Fenghui Wang

Sparse Recovery in Inverse Problems

Convergence Rates in Regularization for Nonlinear Ill-Posed Equations Involving m-accretive Mappings in Banach Spaces

Accelerated Landweber iteration in Banach spaces. T. Hein, K.S. Kazimierski. Preprint Fakultät für Mathematik

Robust error estimates for regularization and discretization of bang-bang control problems

arxiv: v1 [math.na] 21 Aug 2014 Barbara Kaltenbacher

Linear convergence of iterative soft-thresholding

Tikhonov Replacement Functionals for Iteratively Solving Nonlinear Operator Equations

Regularization for a Common Solution of a System of Ill-Posed Equations Involving Linear Bounded Mappings 1

Nesterov s Accelerated Gradient Method for Nonlinear Ill-Posed Problems with a Locally Convex Residual Functional

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

Numerische Mathematik

A semismooth Newton method for L 1 data fitting with automatic choice of regularization parameters and noise calibration

Contents: 1. Minimization. 2. The theorem of Lions-Stampacchia for variational inequalities. 3. Γ -Convergence. 4. Duality mapping.

B. Appendix B. Topological vector spaces

Two-parameter regularization method for determining the heat source

ON THE DYNAMICAL SYSTEMS METHOD FOR SOLVING NONLINEAR EQUATIONS WITH MONOTONE OPERATORS

Numerical Methods for Large-Scale Nonlinear Systems

PDEs in Image Processing, Tutorials

REGULAR LAGRANGE MULTIPLIERS FOR CONTROL PROBLEMS WITH MIXED POINTWISE CONTROL-STATE CONSTRAINTS

A Double Regularization Approach for Inverse Problems with Noisy Data and Inexact Operator

arxiv: v1 [math.na] 16 Jan 2018

A NOTE ON THE NONLINEAR LANDWEBER ITERATION. Dedicated to Heinz W. Engl on the occasion of his 60th birthday

An Iteratively Regularized Projection Method for Nonlinear Ill-posed Problems

1 The Observability Canonical Form

Proximal-like contraction methods for monotone variational inequalities in a unified framework

Gradient methods for minimizing composite functions

Viscosity Approximative Methods for Nonexpansive Nonself-Mappings without Boundary Conditions in Banach Spaces

An Iteratively Regularized Projection Method with Quadratic Convergence for Nonlinear Ill-posed Problems

NOTES ON FRAMES. Damir Bakić University of Zagreb. June 6, 2017

Convex Analysis and Optimization Chapter 2 Solutions

Accelerated Newton-Landweber Iterations for Regularizing Nonlinear Inverse Problems

SPARSE SIGNAL RESTORATION. 1. Introduction

An improved convergence theorem for the Newton method under relaxed continuity assumptions

Adaptive discretization and first-order methods for nonsmooth inverse problems for PDEs

2. Function spaces and approximation

ON GAP FUNCTIONS OF VARIATIONAL INEQUALITY IN A BANACH SPACE. Sangho Kum and Gue Myung Lee. 1. Introduction

Gradient methods for minimizing composite functions

Unconstrained optimization

Optimization and Optimal Control in Banach Spaces

Functional Analysis. Franck Sueur Metric spaces Definitions Completeness Compactness Separability...

A Viscosity Method for Solving a General System of Finite Variational Inequalities for Finite Accretive Operators

Non-smooth Non-convex Bregman Minimization: Unification and new Algorithms

ON GENERALIZED-CONVEX CONSTRAINED MULTI-OBJECTIVE OPTIMIZATION

Interior-Point Methods for Linear Optimization

PROBLEMS. (b) (Polarization Identity) Show that in any inner product space

David Hilbert was old and partly deaf in the nineteen thirties. Yet being a diligent

BREGMAN DISTANCES, TOTALLY

Conditional stability versus ill-posedness for operator equations with monotone operators in Hilbert space

A Double Regularization Approach for Inverse Problems with Noisy Data and Inexact Operator

Math 273a: Optimization Subgradients of convex functions

Suppose that the approximate solutions of Eq. (1) satisfy the condition (3). Then (1) if η = 0 in the algorithm Trust Region, then lim inf.

l(y j ) = 0 for all y j (1)

PhD Course: Introduction to Inverse Problem. Salvatore Frandina Siena, August 19, 2012

Exercise Solutions to Functional Analysis

A strongly polynomial algorithm for linear systems having a binary solution

The small ball property in Banach spaces (quantitative results)

Part 3: Trust-region methods for unconstrained optimization. Nick Gould (RAL)

Necessary conditions for convergence rates of regularizations of optimal control problems

On nonexpansive and accretive operators in Banach spaces

Solutions and Notes to Selected Problems In: Numerical Optimzation by Jorge Nocedal and Stephen J. Wright.

Accelerating Nesterov s Method for Strongly Convex Functions

arxiv: v1 [math.na] 28 Jan 2009

Convex Optimization Notes

Descent methods. min x. f(x)

GENERAL NONCONVEX SPLIT VARIATIONAL INEQUALITY PROBLEMS. Jong Kyu Kim, Salahuddin, and Won Hee Lim

arxiv:math.fa/ v1 2 Oct 1996

Kaisa Joki Adil M. Bagirov Napsu Karmitsa Marko M. Mäkelä. New Proximal Bundle Method for Nonsmooth DC Optimization

Proof. We indicate by α, β (finite or not) the end-points of I and call

Introduction to Convex Analysis Microeconomics II - Tutoring Class

Unbiased Risk Estimation as Parameter Choice Rule for Filter-based Regularization Methods

Iterative Reweighted Minimization Methods for l p Regularized Unconstrained Nonlinear Programming

A posteriori error estimates for the adaptivity technique for the Tikhonov functional and global convergence for a coefficient inverse problem

Generalized Local Regularization for Ill-Posed Problems

Chapter 7. Extremal Problems. 7.1 Extrema and Local Extrema

University of Houston, Department of Mathematics Numerical Analysis, Fall 2005

Journal of Complexity. New general convergence theory for iterative processes and its applications to Newton Kantorovich type theorems

A Randomized Nonmonotone Block Proximal Gradient Method for a Class of Structured Nonlinear Programming

Inverse problems Total Variation Regularization Mark van Kraaij Casa seminar 23 May 2007 Technische Universiteit Eindh ove n University of Technology

Regularization and Inverse Problems

Chapter 1. Preliminaries. The purpose of this chapter is to provide some basic background information. Linear Space. Hilbert Space.

RIESZ BASES AND UNCONDITIONAL BASES

An asymptotic ratio characterization of input-to-state stability

PROJECTIONS ONTO CONES IN BANACH SPACES

Near-Potential Games: Geometry and Dynamics

Investigating the Influence of Box-Constraints on the Solution of a Total Variation Model via an Efficient Primal-Dual Method

Bayesian Paradigm. Maximum A Posteriori Estimation

Viscosity Iterative Approximating the Common Fixed Points of Non-expansive Semigroups in Banach Spaces

Sparsity Regularization

Some lecture notes for Math 6050E: PDEs, Fall 2016

A metric space X is a non-empty set endowed with a metric ρ (x, y):

FUNCTION BASES FOR TOPOLOGICAL VECTOR SPACES. Yılmaz Yılmaz

Transcription:

A GLOBAL MINIMIZATION ALGORITHM FOR TIKHONOV FUNCTIONALS WITH SPARSITY CONSTRAINTS WEI WANG, STEPHAN W. ANZENGRUBER, RONNY RAMLAU, AND BO HAN arxiv:1401.0435v [math.na] 16 May 014 Abstract. In this paper we present a globally convergent algorithm for the computation of a minimizer of the Tikhonov functional with sparsity promoting penalty term for nonlinear forward operators in Banach space. The dual TIGRA method uses a gradient descent iteration in the dual space at decreasing values of the regularization parameter α j, where the approximation obtained with α j serves as the starting value for the dual iteration with parameter α j+1. With the discrepancy principle as a global stopping rule the method further yields an automatic parameter choice. We prove convergence of the algorithm under suitable step-size selection and stopping rules and illustrate our theoretic results with numerical experiments for the nonlinear autoconvolution problem. 1. Introduction Tikhonov regularization has become a well-established and widely used method for the stable solution of ill-posed problems. Its regularizing properties as well as convergence and rates of convergence of the approximate solutions have been studied in great detail. In particular, the theory for the classical formulation, where the penalty term is the square of the norm in a Hilbert space, is well developed (see, e.g. [1]). Over the last decade, there has been a growing interest in more general convex penalty terms in Banach spaces as well. Using a total variation type penalty [1, 9, 4, 31], for example, allows for the reconstruction of sharp edges in images, whereas l p -norms of basis coefficients with p < [10, 9, 7] are known to promote sparsity in the solution a desireable effect in many applications. We refer the interested reader to the monographs [33, 34] and to the incomplete list [, 3, 4, 6, 14, 16, 17, 30] for further information in this direction. Open questions remain, however, concerning the computational aspects of finding a minimizer of the Tikhonov functional. Especially, if the forward operator is non-linear many optimization routines will only converge locally and it may not be possible to obtain a global minimizer. Several attempts have been made to tackle this problem in the classical Hilbert space setting. In [5, 6] the TIGRA (TIkhonov-GRAdient) algorithm was introduced which applies a gradient descent method to the Tikhonov functional at a decreasing sequence of regularization parameters α j = q j α 0. This algorithm was further studied in [0, 1, ]. Another approach to obtain a global minimizer are multi-level methods [18, 19, 11], which 010 Mathematics Subject Classification. 65J0, 47J06, 47A5, 49J40. Key words and phrases. Inverse problems, Tikhonov regularization, global optimization, gradient descent method, TIGRA, parameter choice rules. 1

WEI WANG, STEPHAN W. ANZENGRUBER, RONNY RAMLAU, AND BO HAN minimize at different discretization levels using a gradient descent method for the least-squares functional. In this paper, we will present a globally convergent minimization routine to obtain sparse reconstructions. To promote sparsity in the solution with respect to a given (Schauder) basis, we represent Banach space elements in terms of their coefficents and consider a nonlinear operator equation F(x) = y, y Rg(F), between X = l p and a Hilbert space Y. Suppose that we are given only noisy data y δ satisfying y y δ Y δ, then as regularized solutions x δ α we choose the minimizers of the generalized Tikhonov functional with l p -penalty, (1.1) Φ α (x) := 1 F(x) yδ + α p x p l p, 1 < p, where x p l p = i N x i p. The algorithm we propose to obtain a minimizer is a generalization of the TIGRA algorithm to Banach spaces, which uses a gradient descent iteration in the dual space and will thus be called dual TIGRA or simply d-tigra. To the best of our knowledge, this is the first time a globally convergent algorithm is presented for the Tikhonov functional (1.1) with non-linear forward operator. Gradient descent methods in the dual space with linear F have already been considered in [5], for example. The global convergence of d-tigra results from a directional convexity of the Tikhonov functional Φ α (x) in a ball around a global minimizer x δ α and the size of this region of convexity grows unboundedly as α. Conceptually, the d-tigra algorithm goes as follows. Fix the starting value x 0 X and find α 0 large enough such that x 0 belongs to the region of convexity of Φ α0 (x). The dual gradient descent iterates converge to a minimizer of Φ α0 (x) provided that the step-sizes and stopping rule are suitably chosen. Once the iteration stops, reduce the value of α by a factor q < 1 such that the last iterate again belongs to the region of convexity for α j+1 = qα j. In this fashion, we continue with the dual gradient descent method as an inner iteration and to reduce α j by a factor q in the outer iteration until the discrepancy principle F(x j,k ) yδ τδ issatisfied, wherex j,k denotesthelastiteratewith regularizationparameterα j. Thisalgorithmgeneratesasequence {x j,k } j j,k k (j) that converges towards a global minimizer of Φ αj (x) and thus also provides an automatic choice of the regularization parameter, which turns out to give near optimal estimates with respect to the Bregman distance and performs favorably in numerical experiments. The paper is organized as follows: In Section we introduce basic concepts and notations, formulate our standing assumptions and give general regularization results. Then the main theoretical results are presented in Section 3, where we obtain the region of convexity of the Tikhonov functional, we ensure that any

GLOBAL MINIMIZATION OF TIKHONOV FUNCTIONALS 3 starting value x 0 belongs to this region of convexity for sufficiently large α, we give a convergence analysis for the dual gradient descent iteration, and we show that the final iterate x j,k at level j can be used as a starting point for the dual gradient descent methodatlevel j+1withα j+1 = qα j. Basedonthese findings, we summarize our assumptions on the parameters and prove the global convergence result for the dual TIGRA method in Section 4, which we illustrate in Section 5 with a numerical example. Several technical proofs are collected in Section 6 and, finally, we have included an overview over several constants that appear in the analysis at the end, for the convenience of the reader.. Preliminaries Let(Y,. Y )bearealhilbertspaceandf : l p (N) Y beanon-linearoperator, 1 < p. We assume that F is sequentially closed in the weak topologies of l p and Y. Whenever there is no risk of confusion, we will denote the norms in l p, l q = (l p ) (where 1/p + 1/q = 1) and Y simply by.. Similarly, the duality product.,. l q lp as well as the scalar product in Y are denoted by.,.. The Bregman distance is a powerful tool for regularization theory in Banach spaces and it will also play a key role in our analysis. Definition.1. Let X be a Banach space, f : X R be a convex functional and denote by f(x) the subdifferential of f at x X. The Bregman distance D ξ f (z,x) of two elements x,z X with respect to ξ f(x) is defined by D ξ f (z,x) := f(z) f(x) ξ,z x X X. Clearly, D ξ f (x,x) = 0 and the convexity of f further implies that Dξ f (z,x) 0 for all ξ f(x). For strictly convex functionals, we have that D ξ f (x,z) = 0 if and only if x = z. If the functional f is Fréchet differentiable at x X, then f(x) = { f(x)} and we will write D f (z,x) omitting the dependence of the Bregman distance on ξ = f(x). All of the above is, indeed, the case for the l p functionals f p (x) := 1 p x p l = 1 x p i p, 1 < p, p which are of main interest to us here. The Bregman distance then reads i N D fp (z,x) := 1 p z p 1 p x p f p (x),z x, x,z l p, where f p (x) = { x i p 1 sgn(x i )} i N l q, and the latter is closely linked to the duality mapping in l p with gauge function t t p. Definition.. The set-valued duality mapping J p with p > 1 of a Banach space X into X is defined by J p (x) := {x X : x,x X X = x X x X, x X = x p 1 X }.

4 WEI WANG, STEPHAN W. ANZENGRUBER, RONNY RAMLAU, AND BO HAN From here on, we consider X = l p (N) with fixed 1 < p, where the duality mappings are single valued, J p (x) = ( 1p ) (x) = f plp p (x), and we slightly abuse notation identifying the set J p (x) with its unique element. Hence, (.1) J q (J p (x)) = x and J p (x) X = x p 1 X hold, and the Bregman distance D fp (z,x) can be equivalently expressed as D fp (z,x) = 1 q x p + 1 p z p J p (x),z, (.) = 1 q J p(x) q 1 q J p(z) q J p (x) J p (z),z = D fq (J p (x),j p (z)) for all x,z l p. In addition, the following important inequalities hold in l p which is -convex and p-smooth. They are based on results in [35, 8] and were also used in [3, 8], for example. Lemma.3. Let 1 < p, then there exists a constant c p > 0 such that (.3) D fp (z,x) c p x z p, holds for all x,z l p. For c 1,c > 0 and x,z l p satisfying x c 1 and x z c it holds that (.4) D fp (z,x) c p x z, with c p = (p 1) (c 1 +c ) p. Proof. We first note, that (.3) holds true in any p-smooth Banach space. The proof of (.4) is based on the estimate from [8, Lemma 1.4.6] 1 (.5) p ( x z + z )p 1 p z p z x x p 1 J p (x),z x, as well as on the mean value theorem for φ(t) = (t+ x ) p in the form, φ(t) = x p +p x p 1 t+ t Indeed, for x c 1 and t = x z c, we obtain 0 p(p 1)(τ + x ) p (t τ)dτ. D fp (z,x) 1 p φ( x z ) 1 p x p x p 1 x z p 1 (c 1 +c ) p x z. The assertion follows with c p = (p 1) (c 1 +c ) p. On the other hand, in X = l q with q which is q-convex and -smooth, the following inequalities hold. The proof is again based on [8, Lemma 1.4.6] and goes as the proof of Lemma.3 with the inequality signs reversed.

GLOBAL MINIMIZATION OF TIKHONOV FUNCTIONALS 5 Lemma.4. For q, there exists a constant c q > 0 such that (.6) D fq (z,x) c q x z q, holds for all x,z l q. For c 1,c > 0 and x,z l p satisfying x c 1 and x z c it holds that (.7) D fq (z,x) c q x z, with c q = (q 1) (c 1 +c ) q. Throughout this paper, we denote by x δ α a minimizer of the Tikhonov functional Φ α (x) in (1.1) with noisy data y δ satisfying y y δ δ and regularization parameter α > 0. Here, the given data y δ as well as the noise level δ are assumed to be fixed. Existence of minimizers as well as stability results for the regularized problems are well-known and we refer to [33, 34] for details. By x we denote a f p - minimizing solution of the original problem F(x) = y, in the sense that F(x ) = y and f p (x ) = min x D(F) {f p(x) : F(x) = y}. Let us now summarize our standing assumptions on the forward operator F and a source condition on the f p -minimizing solution x. Assumption.5. The operator F(x) is Fréchet differentiable on l p and the following holds: (i) F is Lipschitz continuous, i.e. there exists L > 0 such that holds for all x,z l p. (ii) There exists c > 0 such that F (x) F (z) L(l p,y) L x z F(x) F(z) F (z)(x z) cd fp (x,z) holds for all x,z l p. (iii) There exist s > 0, < 1 sc, a f p-minimizing solution x X and ω Y such that ω and J p (x ) = F (x ) ω. Remark.6. It is well known that Lipschitz continuity of the Fréchet derivative yields (.8) F(x) F(z) F (z)(x z) L x z and thus, on bounded domains, (i) and (ii) in Assumption.5 are also related via Lemma.3. The nonlinearity condition(ii) in combination with the source condition (iii) was first introduced by Resmerita and Scherzer in [30] to obtain convergence rates for Tikhonov regularization with sparsity constraints; we also refer to [33, 15]. Here we derive slightly different estimates, which are better suited for our purposes, but

6 WEI WANG, STEPHAN W. ANZENGRUBER, RONNY RAMLAU, AND BO HAN result in the same rates of convergence. Similar results for an a-posteriori parameter choice rule, namely Morozov s discrepancy principle, have been established in [3, 4, 17]. Theorem.7. Let y δ Y with y δ y δ and x δ α be a minimizer of Φ α(x) given in (1.1) with 1 < p. If Assumption.5 (ii), (iii) hold true for a f p -minimizing solution x X, then we have (.9) F(x δ α) y δ α ω +δ, D fp (x δ α,x ) (δ +α ω ) (1 c ω )α. Proof. Since x δ α is a minimizer of Φ α (x), it follows that 1 F(xδ α) y δ +αf p (x δ α) αf p (x ) 1 F(x ) y δ. The source condition in Assumption.5 (iii) thus yields 1 F(xδ α ) yδ +αd fp (x δ α,x ) 1 F(x ) y δ α J p (x ),x δ α x Therefore 1 δ α ω,f (x )(x δ α x ) = 1 δ +α ω,y y δ +y δ F(x δ α)+f(x δ α) y F (x )(x δ α x ) 1 δ +αδ ω +α ω F(x δ α) y δ +α ω cd fp (x δ α,x ). F(x δ α ) yδ α ω F(x δ α ) yδ +(1 c ω )αd fp (x δ α,x ) δ +αδ ω holds and finally ( F(x δ α ) y δ α ω ) +(1 c ω )αdfp (x δ α,x ) (δ +α ω ) implies both error estimates in (.9). Oneeasilyverifies thattherighthandsideintheestimateford fp (x δ α,x )in(.9) is minimized if the regularization parameter is chosen as α = δ/ ω. This a-priori parameter choice rule would thus give optimal convergence rates estimates in terms of the Bregman distance. Since the source element ω itself will usually not be at hand, we might use the estimate ω from Assumption.5 (iii) and choose α = δ/ instead. This is indeed feasible, if the scaling parameter s = 3 and the d-tigra algorithm will approximate α = δ/ from above (cf. Proposition 4.). In what follows, we will consider values α α, where (.10) α := and for such α (.9) yields δ (s ) (.11) F(x δ α) y δ α ω +δ s α

GLOBAL MINIMIZATION OF TIKHONOV FUNCTIONALS 7 which will turn out to be of great use. Thus, the scaling parameter s > in Assumption.5 (iii) can be regarded as a trade off between the smallness condition < 1 sc and the choice of the regularization parameter α. We conclude this section with two important consequences of Assumption.5. It is worth noting, that the bounds obtained below are not necessarily optimal. Similar yet more elaborate estimates were already used in [6]. For our purposes, however, it is the existence of uniform bounds in α which is primarily important. Lemma.8. If Assumption.5 holds, then (.1) ( p ) 1/p x δ α A := F(0) y δ /p α (.13) F (x δ α) K := LA+ F (0), for all α α. Proof. From the minimizing property of x δ α we obtain αf p (x δ α) Φ α (x δ α) Φ α (0) 1 F(0) yδ. Thus, for α α, x δ α p p α F(0) yδ p. α F(0) yδ Now, (.1) readily follows by taking the p-th root. For the estimate on F (x δ α ) we observe that by Assumption.5 (i) F (x δ α ) F (x δ α ) F (0) + F (0) L x δ α + F (0). Thus (.13) follows using (.1). Under our standing assumptions, the distance of minimizers x δ α and x δ ᾱ corresponding to nearby values α and ᾱ turns out to be of the order of α ᾱ. This result has essential implications regarding the outer update step where we decrease the regularization parameter α by a factor q < 1 (cf. Lemma 3.10 and Proposition 3.11). The rather technical proof is postponed to Section 6.1. Proposition.9. Let Assumption.5 hold and let ᾱ α. Then both x δ α and F(x δ α ) are continuous from the right with respect to ᾱ, in the sense that lim α +ᾱ xδ α xδ ᾱ = 0, In particular, for ᾱ α ᾱ q 0 with (.14) q 0 := sc 1+sc < 1, we have lim α +ᾱ F(xδ α ) F(xδ ᾱ ) = 0. (.15) x δ α xδ ᾱ σ(α ᾱ), where σ = s K c A (1 sc )α and c A := p 1 (3A)p with A from Lemma.8.

8 WEI WANG, STEPHAN W. ANZENGRUBER, RONNY RAMLAU, AND BO HAN Remark.10. Note, that Proposition.9 implies uniqueness of the minimizer x δ α for α α. Indeed, if we let ᾱ = α with arbitrarily chosen minimizers x δ α, xδ ᾱ, then (.15) yields x δ ᾱ xδ α = 0. 3. The iterated dual gradient descent method In this section we analyze a dual gradient descent method for the computation of a minimizer of the Tikhonov functional with sparsity constraints using the discrepancy principle as a global stopping rule. Several technical proofs are collected in Section 6 for improved readability. The algorithm can be summarized as follows: Algorithm 1: The dual TIGRA algorithm input : q < 1, α 0 and x 0. init : j = 0 and x 0,0 = x 0. 1 while F(x j,k ) y δ τδ do if j 1 then 3 α j+1 = qα j 4 x j+1,0 = x j,k 5 end 6 j = j +1, k = 0 7 while Φ αj (x j,k ) < C j do 8 J p (x j,k+1 ) = J p (x j,k ) β j,k Φ αj (x j,k ) 9 x j,k+1 = J q (J p (x j,k+1 )) 10 k = k +1 11 end 1 end There are several requirements we need to verify in order to prove that the above algorithm is well-defined and globally convergent. First of all, for each j the dual gradient descent algorithm in lines 7-11 starting from x j,0 should converge to a minimizer x δ α j of Φ αj (x) and, secondly, Φ αj (x j,k ) should tend to zero as k, so that each inner iteration terminates after a finite number of steps. Additionally we need to ensure that the discrepancy principle can be used as a stopping rule for the outer iteration. 3.1. A convexity property of the Tikhonov functional. In the following, we would like to investigate the local directional convexity of the Tikhonov functional Φ α (x) near a minimizer x δ α, i.e. the convexity of the functions (3.1) ϕ α,h (t) := Φ α (x δ α +th), t R for h X with h = 1. To this end, recall the following characterization of convexity on intervals.

GLOBAL MINIMIZATION OF TIKHONOV FUNCTIONALS 9 Proposition 3.1. A continuously differentiable function ϕ : R R is convex on an interval I if and only if ϕ(t ) ϕ(t 1 ) ϕ (t 1 )(t t 1 ) 0 for all t 1,t I and ϕ is strictly convex if equality only holds for t 1 = t. Thus, ϕ α,h (t) from (3.1) is convex on [ r,r], if 0 ϕ α,h (t ) ϕ α,h (t 1 ) (t t 1 )ϕ α,h(t 1 ) = R Φα (x δ α +t h,x δ α +t 1 h) holds for all t 1,t [ r,r], where (3.) R Φα (z,x) := Φ α (z) Φ α (x) Φ α (x),z x for x,z X. The following Theorem asserts the existence of such r as a function of the regularization parameter α. The proof can be found in Section 6.. Theorem 3.. Let Assumption.5 hold and let γ = 1 sc. Then, for all α α and h X with h = 1, R Φα (x δ α +t h,x δ α +t 1 h) γαd fp (x δ α +t h,x δ α +t 1 h) holds for t 1, t r α, where R Φα is as defined in (3.) and { } γα γα (3.3) r α := 1+ min csk,. 10cL In particual, ϕ α,h (t) in (3.1) is thus convex on [ r α,r α ]. Of course, Theorem 3. does not ensure that Φ α (x) is itself convex in any neighbourhood of x δ α. We will say that Φ α(x) is directionally convex on B rα (x δ α ), tacitly assuming that only lines through x δ α are considered. Here B r (x) denotes the ball of radius r centered at x X, i.e. B rα (x δ α ) = {x X : x xδ α r α}. Note that r α does not depend on the direction h and that r α as α. As a first consequence of Theorem 3., we obtain an estimate on the directional derivative of Φ α (x) in direction x δ α. Proposition 3.3. Let Assumption.5 hold and let x B rα (x δ α) with r α from Theorem 3. for some α α, then (3.4) Φ α (x),x x δ α γαd fp (x δ α,x) holds with γ = 1 sc. Proof. Using the minimizing property of x δ α and Theorem 3., we obtain and the proof is complete. Φ α (x),x x δ α = R Φ α (x δ α,x) Φ α(x δ α )+Φ α(x) R Φα (x δ α,x) γαd f p (x δ α,x),

10 WEI WANG, STEPHAN W. ANZENGRUBER, RONNY RAMLAU, AND BO HAN Remark 3.4. Note that, due to the strict convexity of f p (x), Proposition 3.3 in particular implies that there are no critical points other than x δ α inside B r α (x δ α ). The size r α of this region increases as α and due to Proposition.9 there exists α such that x δ α 0, but 0 B rα (x δ α). Consequently, F (0) cannot be the zero mapping as otherwise Φ α (0) = 0 would hold and thus x = 0 would be a critical point of Φ α (x). 3.. Initial values. Our convergence analysis below relies primarily on the Bregman distance and for the strictly convex functionals f p (x) it is well known that D fp (x δ α j,x j,k ) 0 also yields x δ α j x j,k 0. But at the same time it will be important to ensure that the iterates x j,k remain inside the region of directional convexity B rαj (x δ α j ) with radius r αj from Theorem 3. throughout the iteration (compare the proof of Theorem 3.9, for example). The following general result asserts that indeed both can be achieved simultaneously. The proof is given in Section 6.3. Lemma 3.5. To every α α there exists d α > 0 such that B dα (x δ α ) B r α (x δ α ), where B d (z) denotes the ball of radius d around z X with respect to the Bregman distance, i.e. (3.5) B dα (x δ α ) := {x X : D f p (x δ α,x) d α}. Moreover, the numbers d α may be chosen such that they depend on α continuously, they are strictly monotonically increasing and that d α as α. In view of Theorem 3.9 below, we refer to B dα (x δ α) as the region of convergence of the dual gradient descent method. For the initiation of the algorithm it will be important that the starting point x 0 belongs to B dα0 (x δ α 0 ) for the regularization parameter α 0. It is our next objective to show that the latter is indeed the case for sufficiently large α 0. Proposition 3.6. Let Assumption.5 hold, then to each starting value x 0 X there exists α 0 > α large enough such that x 0 B dα0 (x δ α 0 ) as defined in (3.5). Proof. Observe that by Lemma.3 and Lemma.8, D fp (x δ α, x 0 ) c p x δ α x 0 c p (A+ x 0 ) remains bounded, whereas d α as α according to Lemma 3.5. Thus, there exists α 0 such that D fp (x δ α 0, x 0 ) d α0 and the proof is complete. 3.3. Convergence of the dual gradient descent method. Let us now analyze the dual gradient descent iteration for fixed regularization parameter α j α, which reads as follows: (3.6) J p (x j,k+1 ) = J p (x j,k ) β j,k Φ αj (x j,k ) x j,k+1 = J q (J p (x j,k+1 )).

GLOBAL MINIMIZATION OF TIKHONOV FUNCTIONALS 11 Inparticular, we will show that thesequence of iterates{x j,k } k N approaches the minimizer x δ α j in the Bregman distance. Then, for the strictly convex functionals f p (x) with 1 < p, this even implies convergence in norm. For an appropriate starting value x j,0 and suitably chosen step-sizes β j,k the Bregman distance between the iterates x j,k and the exact minimizer x δ α j does not increase throughout the algorithm. We prove this result in Section 6.4. Theorem 3.7. Let Assumption.5 hold and suppose that D fp (x δ α j,x j,k ) d α with d α as in Lemma 3.5 and that Φ αj (x j,k ) 0. Then x j,k+1 from (3.6) is at least as close to x δ α j as x j,k with respect to the Bregman distance, i.e. if the step-size β j,k is chosen such that (3.7) β j,k holds, where (3.8) c j,k := min with c A := p 1 (r α +A) p and D fp (x δ α j,x j,k+1 ) D fp (x δ α j,x j,k ), γ c j,k α j Φ αj (x j,k ), 1, 8 c A (Φ αj (x j,k ) φ j,k ) 4K +4Ls α j +4LKr αj +L r αj +8α j c p c p A φ j,k := min{φ αj (J q (J p (x j,k )+t Φ αj (x j,k ))) : t R + }. Using the decay with respect to the Bregman distance, we also obtain monotonicity in the Tikhonov functional values Φ αj (x j,k ) and that Φ αj (x j,k ) 0 as k. It is worth noting that the latter also ensures that the inner stopping rule Φ αj (x j,k ) C αj is well-defined for any value C αj > 0. For the proof, which goes along the lines of Theorem.9 in [5], we refer to Section 6.5, where the constants M αj are explicitely given as well. Proposition 3.8. Let Assumption.5 hold and suppose that D fp (x δ α j,x 0 ) d α with d α as in Lemma 3.5. Then there exists a constant M αj > 0 such that the sequence {x j,k } k N0 generated by (3.6) with step-sizes { } γ cj,k α j 1 (3.9) β j,k := min Φ αj (x j,k ), M αj satisfies Φ αj (x j,k+1 ) Φ αj (x j,k ) and R Φαj (x j,k+1,x j,k ) M αj D fp (x j,k+1,x j,k ) for all k 0, as well as Φ αj (x j,k ) 0 as k. Now we can give the convergence result for the dual gradient descent method with fixed α j α.

1 WEI WANG, STEPHAN W. ANZENGRUBER, RONNY RAMLAU, AND BO HAN Theorem 3.9. Suppose that Assumption.5 holds and that x j,0 B dαj (x δ α j ) for α j α. Let {x j,k } k N0 be the sequence generated by the dual iteration (3.6) with step-sizes β j,k from (3.9), then x j,k converges to the global minimizer x δ α j of Φ αj, i.e. x j,k x δ α j as k. Proof. We show D fp (x δ α j,x j,k ) 0, then x j,k x δ α j follows due to the strict convexity of the functionals f p (x). From Proposition 3.3, we obtain D fp (x δ α j,x j,k ) 1 γα j Φ αj (x j,k ),x j,k x δ α j 1 γα j Φ αj (x j,k ) x j,k x δ α j. Thus, the result follows as, on the one hand, x j,k x δ α j r αj remains bounded for all k throughout the iteration due to x j,0 B dαj (x δ α j ) and Theorem 3.7, and, on the other hand, Φ αj (x j,k ) 0 as k according to Proposition 3.8. 3.4. Updating the regularization parameter. According to Theorem 3.9, the dual gradient descent method with regularization parameter α j converges to the minimizer x δ α j of the Tikhonov functional Φ αj (x), if the starting value x j,0 belongs to the region of convergence, x j,0 B dαj (x δ α j ). Based on Proposition.9, we will now show that is the case for the choice x j+1,0 = x j,k (j) if the update α j+1 = qα j is not too large and if C j in the stopping rule Φ αj (x j,k ) C j is chosen suitably. To this end, we first specify the requirements for the update factor q. Lemma 3.10. Let Assumption.5 hold and let α 0 be chosen as in Proposition 3.6. If q ( q 0,1) with q 0 from (.14) is chosen such that (3.10) ρσα 0 (1 q) min { d α /3,1 } 1/p holds, where σ is as in Proposition.9 and ρ := max { c p, A p 1 +r } p 1 α0 with c p and d α as in Lemma.3 and Lemma 3.5, respectively, then for all α (α,α 0 ] (3.11) D fp (x δ ᾱ,x δ α) d ᾱ 3 holds, where ᾱ := max{ qα,α }. and x δ ᾱ x δ α d ᾱ 3ρ Proof. Note that the existence of q as in (3.10) is guaranteed as only the left hand side in (3.10) tends to zero as q 1. Thus, by virtue of Proposition.9, we obtain x δ α x δ ᾱ σα 0 (1 q) d α 3ρ d α 3ρ, and, due to Lemma.3, also D fp (x δ α,xδ ᾱ ) c p x δ ᾱ xδ α p (ρσα 0 (1 q)) p min { d α /3,1 }p d α 3, where we have used qα ᾱ α min{ ᾱ q 0,α 0 } as well as the monotonicity of d α.

GLOBAL MINIMIZATION OF TIKHONOV FUNCTIONALS 13 Proposition 3.11. Let q be chosen as in Lemma 3.10. Suppose that x j,0 B dαj (x δ α j ) and let {x j,k } k 0 be the sequence generated by (3.6) with step-sizes β j,k from (3.9). If x j,k is defined to be the first iterate which satisfies (3.1) Φ αj (x j,k ) C αj, with (3.13) C αj γα jd αj+1 3r αj, then x j,k B dαj+1 (x δ α j+1 ), where α j+1 := max{ qα j,α } and q is as in Lemma 3.10. Proof. Note that D fp (x δ α j+1,x j,k ) = D fp (x δ α j+1,x δ α j )+D fp (x δ α j,x j,k )+ J p (x δ α j ) J p (x j,k ),x δ α j+1 x δ α j. Due to (.1) and Lemma.8, we obtain J p (x δ α j ) J p (x j,k ) x δ α j p 1 + x j,k p 1 ρ with ρ from Lemma 3.10. In addition, using Proposition 3.3 D fp (x δ α j,x j,k ) 1 γα j Φ αj (x j,k ),x j,k x δ α j 1 Φ αj (x j,k ) x j,k x δ α γα j C jr αj d α j+1 j γα j 3 holds as x j,k B dαj (x δ α j ) due to Theorem 3.7. Combining these estimates with (3.11) in Lemma 3.10, we obtain D fp (x δ α j+1,x j,k ) d α j+1 3 + d α j+1 3 +ρ d α j+1 3ρ = d α j+1. 4. The global minimization strategy Aswehave seen, forfixed α j α thedualgradientdescent iteration(3.6)yields a sequence of iterates {x j,k } k N0 which converges to x δ α j and will terminate with an approximation x j,k. But we have yet to ensure that the proposed algorithm terminates after a finite number j of outer iteration steps with a regularization parameter α j α. Before we give this main convergence result for the d-tigra method, we collect here all the assumptions on the parameters, which determine the behavior of the algorithm. For the values of the various constants appearing in the expressions below, we refer the reader to the list of constants which we have included at the end of this paper. Assumption 4.1. Suppose that (i) the initial parameter α 0 is sufficiently large such that x 0 B dα0 (x δ α 0 ), i.e. D fp (x δ α 0, x 0 ) d α0,

14 WEI WANG, STEPHAN W. ANZENGRUBER, RONNY RAMLAU, AND BO HAN (ii) the factor q ( q 0,1) with q 0 from (.14) satisfies ρσα 0 (1 q) min { d α /3,1 }, (iii) the step-sizes β j,k are chosen according to β j,k = min { γ cj,k α j Φ αj (x j,k ), 1 M αj }, (iv) the constants C αj such that { } dαj+1 δ c A C αj γα j min,, 3r αj K +r αj (c c A +L) (v) and τ in the discrepancy principle as τ = + q(s ). For the following Proposition, compare also Proposition 6.3 and 6.4 in [6]. Proposition 4.. Let Assumptions.5 and 4.1 hold, then the iterates x j,k generated by the dual TIGRA algorithm remain inside the region of convergence, i.e. x j,k B dαj (x δ α j ) defined in (3.5), and the algorithm terminates after a finite number j of outer iteration steps with regularization parameter α j α. Proof. We begin by showing that for α 0, q, β j,k, and C αj as in Assumption 4.1, we have x j,k B dαj (x δ α j ): This holds true for the starting value x 0,0 = x 0 by definition of α 0, and as then the Bregman distance decreases with step-sizes β j,k (cf. Theorem 3.7) also for x 0,k, k 0. Finally, for the outer iteration we have x j+1,0 = x j,k B dαj+1 (x δ α j+1 ) according to Proposition 3.11, where x j,k denotes the final iterate with index j, and using an induction argument the assertion follows for all j N 0 and 0 k k (j). According to Lemma 3.5 it thus holds that D fp (x δ α j,x j,k ) d αj and x δ α j x j,k r αj. To prove that the dual TIGRA algorithm stops with α j α, we proceed by contradiction. Note that the sequence α j = q j α 0 is monotonically decreasing and converges to zero since q < 1. Let now j denote the unique index such that α j α and α j+1 = qα j < α and suppose that j j + 1. Due to Lemma.3 with c p = c A = p 1 (A+r α j ) p, the inner stopping rule Φ αj (x j,k ) C αj and Proposition 3.3 we obtain x δ α j x j,k 1 c D fp (x δ α j,x j,k ) 1 Φ αj (x j,k ),x j,k x δ α A c A γα j j C α j c A γα j x δ α j x j,k. Using a boot-strap argument on the one hand, and that x δ α j x j,k r αj on the other hand, it follows that x δ α j x j,k C α j c A γα j and D fp (x δ α j,x j,k ) C α j r αj γα j.

GLOBAL MINIMIZATION OF TIKHONOV FUNCTIONALS 15 Thus, Assumption.5, Lemma.8 and Assumption 4.1 (iv) yield F(x j,k ) F(x δ α j ) F(x j,k ) F(x δ α j ) F (x j,k )(x δ α j x j,k ) + F (x j,k ) x δ α j x j,k cd fp (x δ α j,x j,k )+(Lr αj +K) x δ α j x j,k C α j c A γα j ( rαj (c c A +L)+K ) δ. In combination with F(x δ α j ) y δ α j +δ from (.11) and α j+1 < α = δ (s ) we obtain F(x j,k ) y δ F(x j,k ) F(x δ α j ) + F(x δ α j ) y δ αj+1 +δ q δ +δ τδ. q(s ) Thus thediscrepancy principle withτ fromassumption 4.1(v) is satisfied forx j,k, which contradicts our assumption j j+1. Hence, the iteration terminates with a regularization parameter α j α. 5. Numerical examples In this section we compare numerical results for the d-tigra method to a dual version of the modified Landweber method [3]. Both methods are applied to the nonlinear ill-posed auto-convolution problem(compare also [3]), where the forward operator is given by (5.1) G(f)(s) = s 0 f(s t)f(t)dt, s [0,1],f L [0,1]. It has been shown in [13] that the auto-convolution operator is continuous as G(f 1 ) G(f ) C[0,1] ( f 1 L [0,1] + f L [0,1]) f 1 f L [0,1], with Fréchet derivative G (f) given by [ G (f)h ] s (s) = [f h](s) = f(s t)h(t)dt, s [0,1],h L [0,1] and weakly sequentially closed on the domain 0 D(G) = D + := {f L [0,1] : f(t) 0 a.e. t [0,1]}. As mentioned in [3], G (.) is Lipschitz continuous with constant L = and adjoint G (f) h = ( f h ), where h(t) = (h) (t) = h(1 t). For the reconstruction of solutions which are sparse with respect to a certain Riesz basis or frame {u i } i N of L [0,1], we represent functions f L [0,1] in

16 WEI WANG, STEPHAN W. ANZENGRUBER, RONNY RAMLAU, AND BO HAN terms of their coefficients x such that f = Tx, where the bounded, linear operator T : l L [0,1] is given by Tx := i N x i u i with adjoint T f = { f,u i } i N. In accordance with our theoretic results, we restrict the domain of T to sequences in l p (N) with 1 < p and consider the problem of reconstructing a solution x of F(x) = y, where F := G T : l p Y = L [0,1] from noisy data y δ with y y δ Y δ. Clearly, T : l p L [0,1] as well as T : L [0,1] l q remain bounded as Tx L [0,1] T l L [0,1] x l T l L [0,1] x l p holds for all x l p and an analog expression is obtained for T. As above, the Tikhonov-functional with sparsity constraints reads Φ α (x) := 1 F(x) yδ + α xi p p and we have Φ α (x) = T G (Tx) (G(Tx) y δ )+αj p (x), where J p (x) i = sign(x i ) x i p 1. In our numerical experiments, we compute approximations of an exact solution f which is sparse with respect to the Haar wavelet basis and is described by three nonzero coefficients, namely x ({,4,7}) = {3, 1,0.5}. Due to the quadratic nature of the autoconvolution problem, whenever f is a solution of G(f) = y, then so is f. The functions are sampled at 51 points equispaced in [0,1] and we use wavelet coefficients up to index level J = 9. 5.1. Results for the dual TIGRA method. The Fréchet derivative F (x) of the autoconvolution problem is Lipschitz continuous and thus Assumption.5 (i) is satisfied. However, nonlinearity conditions as required in Assumption.5 (ii) are not known, and recent results from [7] indicate that source conditions as in Assumption.5 (iii) certainly cannot hold for the autoconvolution operator. This is supported by the fact that F (0) w = 0 for all w Y, which contradicts our assumptions according to Remark 3.4. Thus, the theoretically justified choices from Assumption 4.1 are not directly applicable to the problem at hand. Nevertheless, the dual TIGRA method performed well in the numerical experiments and the results were obtained with the following parameters: The initial regularization parameter α 0 = 10 6 was updated in the outer iteration with a factor q = 0.7. Recall that α 0 should be chosen large enough to ensure that the initial guess x 0 belongs to the region of convergence and

GLOBAL MINIMIZATION OF TIKHONOV FUNCTIONALS 17 p = 1. p = 1.6 p = 5 0-5 5 0-5 5 0-5 Figure 1. Reconstructions obtained by the dual TIGRA method with x 0 = 1 and 5% (top), 1% (middle), 0.5% (bottom) noise. q should be sufficiently close to 1 so that x j,k can be used as initial guess for α j+1. In the inner iterations we used as stepsize selection ( 1 ) (5.) β j,k = min Φ αj (x j,k ),0.0 p = 1. p = 1.6 δ x 0 α j j k e k α j j k e k 5% 1 7.5 10 47 90 0.58 1.8 10 51 349 0.33 5% 500 1.8 10 51 199 0.6.6 10 50 14 0.4 5% 1000 1.5 10 3 58 3906 0.43 5. 10 48 311 0.5 5% 10000 4.3 10 3 55 85 0.45 1.1 10 1 46 543 0.5 1% 1 8.8 10 3 53 1697 0. 4.3 10 3 55 1503 0. 1% 500 6. 10 3 54 135 0.1 4.3 10 3 55 1618 0.3 1% 1000 8.8 10 3 53 980 0.1 6. 10 3 54 1571 0.3 1% 10000.1 10 3 57 841 0. 4.3 10 3 55 114 0.4 0.5% 1 1.5 10 3 58 568 0.13 1.5 10 3 58 5006 0.15 0.5% 500 8.8 10 3 53 68 0.097 1.5 10 3 58 4577 0.18 0.5% 1000 7.3 10 4 60 10694 0.14 1.5 10 3 58 681 0.17 0.5% 10000.5 10 4 63 11016 0.079 1.5 10 3 58 5155 0.17 Table 1. Results of the dual TIGRA method with p = 1.,1.6.

18 WEI WANG, STEPHAN W. ANZENGRUBER, RONNY RAMLAU, AND BO HAN p = 1. p = 1.6 p = 5 0-5 5 0-5 5 0-5 Figure. Reconstructions obtained by the modified Landweber method with x 0 = 1 and 5% (top), 1% (middle), 0.5% (bottom) noise. and for the stopping rule C αj = 1.5 α j. In addition we also stopped each inner loop after at most 3000 iterations. Finally, for the discprepancy principle we chose τ =. To verify the algorithm s global convergence behaviour, we have considered various randomly chosen initial guesses x 0 of different size x 0. Some reconstructions fordifferentvaluesofpandδ areshowninfigure1. Furtherresultsaresummarized intable1, wherej denotesthenumber ofouteriterationsandk thetotalnumber of inner iterations combined for all values of j. Moreover, e k = x j,k x / x denotes the relative error corresponding to the final approximation x j,k. 5.. Results for the dual modified Landweber method. In this example, we have adapted an algorithm suggested in [3] for the Hilbert space situation p = to l p with 1 < p. In the present notation the resulting dual modified Landweber algorithm reads (5.3) J p (x k+1 ) = J p (x k ) β k Φ αk (x k ) x k+1 = J q (J p (x k+1 )). In [3] no stepsize was used, i.e. β k = 1 for all k, but we found that using β k as in (5.) significantly improved the performance. Also the choice of the regularization parameters α k was slightly adapted introducing an additional factor x 0, α k = x 0 (k +1000) 0.99.

GLOBAL MINIMIZATION OF TIKHONOV FUNCTIONALS 19 In particular for larger values of x 0, omitting this factor results in underregularized solutions. Some reconstructions obtained with the dual modified Landweber method are shown in Figure and further results summarized in Table. As above k denotes the total number of iterations until the discrepancy principle is satisfied with τ = and e k the relative error in the final approximation. Missing values in the table indicate that either the discrepancy principle was not fulfilled after 10 5 iterations or that the algorithm stopped in zero. In our experiments with x 0 = 10 4 the dual modified Landweber method never produced an approximation satisfying the discrepancy princple and the corresponding lines are thus omitted in Table. p = 1. p = 1.6 δ x 0 α k k e k α k k e k 5% 1 1. 10 3 4.4 1.1 10 3 4 0.83 5% 500 3.1 10 1 1380 0.36 3. 10 1 1305 0.34 5% 1000 3. 10 1 3749 0.34 1% 1 8.6 10 4 69 1.6 6.6 10 4 134 0.59 1% 500 1.4 10 5937 0.1 1.8 10 46437 0. 1% 1000 1.8 10 96354 0. 0.5% 1.6 10 4 4933 1. 1.9 10 4 709 0.48 0.5% 500 5.6 10 3 160831 0.11 6.4 10 3 141508 0.11 0.5% 1000 Table. Results of the mod. Landweber method with p = 1., 1.6. Comparing the results of Subsections 5.1 and 5., we note that in our experiments with small starting value ( x 0 = 1) the modified Landweber method tends to achieve the prescribed accuracy in the data misfit after fewer iterations than the dual TIGRA method. The reconstructions in Figure 1 and Figure show, however, that this increase in speed comes at the cost of a lower reconstruction quality. As the initial value gets larger, the dual TIGRA method outperforms the modified Landweber method and proofs to be robust with respect to x 0. 6. Proofs of selected results We collect here the proofs of several of the results presented throughout the previous sections in order to improve the flow of reading ibidem. To lighten our formulae, we recall the shorthand notations R Φα (z,x) from (3.) and additionally introduce (6.1) R F (z,x) := F(z) F(x) F (x)(z x) for x,z X.

0 WEI WANG, STEPHAN W. ANZENGRUBER, RONNY RAMLAU, AND BO HAN 6.1. Proof of Proposition.9. First, observe that 1 F(xδ ᾱ ) F(xδ α ) = 1 F(xδ ᾱ) y δ 1 F(xδ α) y δ + F(x δ α) F(x δ ᾱ),f(x δ α) y δ. Then, using that x δ ᾱ is a minimizer of Φᾱ(x), we find 1 F(xδ ᾱ) F(x δ α) +ᾱd fp (x δ ᾱ,x δ α) = Φᾱ(x δ ᾱ) Φᾱ(x δ α)+ F(x δ α) F(x δ ᾱ),f(x δ α) y δ ᾱ J p (x δ α),x δ ᾱ x δ α F(x δ α ) F(xδ ᾱ ),F(xδ α ) yδ ᾱ J p (x δ α ),xδ ᾱ xδ α. On the other hand, x δ α is a minimizer of Φ α(x) and the first order optimality condition reads Φ α (x δ α) = F (x δ α) (F(x δ α) y δ )+αj p (x δ α) = 0. Using the resulting expression for J p (x δ α) as well as Assumption.5 (ii) and Lemma.8 we further obtain 1 F(xδ ᾱ ) F(xδ α ) +ᾱd fp (x δ ᾱ,xδ α ) F(x δ α ) F(xδ ᾱ ),F(xδ α ) yδ + ᾱ F (x δ α α ) (F(x δ α ) yδ ),x δ ᾱ α xδ = R F (x δ ᾱ,x δ α),f(x δ α) y δ +(1 ᾱ α ) F (x δ α)(x δ α x δ ᾱ),f(x δ α) y δ cd fp (x δ ᾱ,xδ α ) F(xδ α ) yδ +(1 ᾱ α )K xδ α xδ ᾱ F(xδ α ) yδ. Moreover, for α α Theorem.7 and Assumption.5 (iii) yield whence it follows that F(x δ α ) yδ ω α+δ s α, 1 F(xδ ᾱ) F(x δ α) +ᾱd fp (x δ ᾱ,x δ α) sc αd fp (x δ ᾱ,x δ α)+s K(α ᾱ) x δ α x δ ᾱ. Using α ᾱ q 0 = 1+sc sc ᾱ we find 1 (6.) F(xδ ᾱ ) F(xδ α ) + 1 sc ᾱd fp (x δ ᾱ,xδ α ) s K(α ᾱ) xδ α xδ ᾱ. Finally, due to Lemma.8 we may apply (.4) in Lemma.3 with constant c p = c A = p 1 (3A)p and since ᾱ α we obtain x δ α xδ ᾱ 1 c A D fp (x δ ᾱ,xδ α ) s K c A (1 sc )α (α ᾱ) xδ α xδ ᾱ, which proves (.15). Thus, continuity from the right of the mapping α x δ α readily follows and with (6.) also of the mapping α F(x δ α ).

GLOBAL MINIMIZATION OF TIKHONOV FUNCTIONALS 1 6.. Proof of Theorem 3.. Let α α and h X with h = 1 be fixed. For t 1,t R, we define x i = x δ α + t ih, i = 1,, and t = t t 1. Note that Φ α (x) = F (x) (F(x) y δ )+αj p (x) and recall that Y is a Hilbert space. One easily verifies, Φ α (x ) = 1 F(x 1)+ t F (x 1 )h+r F (x,x 1 ) y δ +αf p (x ) = Φ α (x 1 ) t Φ α (x 1 ),h + 1 t F (x 1 )h + 1 R F(x,x 1 ) y δ F(x 1 ),R F (x,x 1 ) + t F (x 1 )h,r F (x,x 1 ) +αd fp (x,x 1 ). Thus, R Φα (x,x 1 ) = 1 t F (x 1 )h + 1 R F(x,x 1 ) + F(x 1 ) y δ,r F (x,x 1 ) + t F (x 1 )h,r F (x,x 1 ) +αd fp (x,x 1 ) and it remains to establish bounds for the terms which may be negative. Due to Assumption.5 (i) and Lemma.8, we have F (x 1 )h,r F (x,x 1 ) ( F (x 1 ) F (x δ α ) + F (x δ α ) ) cd fp (x,x 1 ) (L t 1 +K)cD fp (x,x 1 ). and, using also Assumption.5 (ii) and (.8), F(x 1 ) y δ,r F (x,x 1 ) ( R F (x 1,x δ α ) + F (x δ α ) t 1 + F(x δ α ) yδ ) R F (x,x 1 ) (s α+k t 1 + L ) t 1 cd fp (x,x 1 ). Collecting these estimates we obtain ( R Φα (x,x 1 ) (1 sc )α c(k +L t ) 1 ) t 1 c(k +L t 1 ) t D fp (x,x 1 ). For t, t 1 r α, we have t r α. Thus, using γ = 1 sc, R Φα (x,x 1 ) γαd fp (x,x 1 ) p(r α )D fp (x,x 1 ), where p(r) := 5cL r 3cKr +γα has the zeros r 1, (α) = 3K ± 9K +10(L/c)γα. 5L

WEI WANG, STEPHAN W. ANZENGRUBER, RONNY RAMLAU, AND BO HAN Now, if r α := min{ r 1, r }, then 9K +10(L/c)γα 3K γα r α = = 5L c ( 9K +10(L/c)γα+3K) { γα 1+ if 10Lγα 9cK, 3cK γα otherwise 10cL { } γα γα 1+ min 3cK, = r α. 10cL Thus, R Φα (x,x 1 ) γαd fp (x,x 1 ) 0 holds for t, t 1 r α and according to Proposition 3.1 the function ϕ α,h (t) is then convex on [ r α,r α ], which completes the proof. 6.3. Proof of Lemma 3.5. By virtue of Lemma.3 we may estimate the Bregman distance D fp (z,x) in terms of the norm x z l p not only from above, but also from below. However, the lower bound involves a number c p which in fact depends on the elements x, z under consideration or, more precisely, on estimates of the size of x and on the distance x z. In this respect the following coercivity result turns out to be particularly useful. Lemma 6.1. Let z l p be arbitrary but fixed, then to every d > 0 there exists a constant C d > z l p such that, if D fp (z,x) d holds for x l p, then x C d. Moreover, the numbers C d may be chosen such that they depend on d continuously and are strictly monotonically increasing on ( z l p, ). Proof. Suppose that x l p satisfies D fp (z,x) d. Using the first identity in (.) and J p (x) = x p 1 we obtain 0 1 q x p + 1 p z p x p 1 z D fp (z,x) d. Note that the function f(t) := 1 q tp z t p 1 + 1 p z p attains its minimum f(t 0 ) = 0 at t 0 = z and that it is continuous and strictly increasing on [t 0, ) with f(t) as t. Thus, for every d > 0 there exists a number C d > t 0 = z such that f(t) > d for all t > C d. Clearly this choice of C d is continuous and strictly increasing and the proof is complete. Remark 6.. From the proof of Lemma 6.1 it follows that the number C d only depends on z (not z itself) and that it increases with z. Thus, for each R > 0, C d = C d (R) may be chosen uniformly for all z R. Proof of Lemma 3.5. Ifdisanypositivenumber andx l p satisfiesd fp (x δ α,x) d, then with C d from Lemma 6.1 and A from Lemma.8 we obtain x C d and x x δ α C d +A. Note that here the numbers C d are assumed to be independent

GLOBAL MINIMIZATION OF TIKHONOV FUNCTIONALS 3 of α α (cf. Remark 6.). Thus, we may apply the lower bound for the Bregman distance from Lemma.3, x x δ α c 1 p,d D f p (x δ α,x) dc 1 p,d, with constant c p,d := p 1 (A+C d) p. Observe that, due to the monotonicity and continuity of C d in Lemma 6.1, the quantity dc 1 p,d is monotonically increasing and continuous, and that dc 1 p,d 0 as d +0. This readily yields the existence of d α > 0 such that D fp (x δ α,x) d α implies x x δ α d α c 1 p,d α = r α, and d α follows from r α as α. 6.4. Proof of Theorem 3.7. Throughout this subsection we assume the outer iteration index j to be fixed. To shorten our formulae we neglect the dependence of the iterates x j,k, the step-sizes β j,k and the regularization parameter α j on j, and simply write x k, β k and α instead. The following Proposition already asserts that the Bregman distance between the minimizer x δ α and the iterates x k decays monotonically if each step-size β k remains below the threshold β k. However, these upper bounds β k are given in terms of the searched-for minimizers x δ α. This dependence is then removed in Theorem 3.7 with the choice (3.9). Proposition 6.3. Let Assumption.5 hold and suppose that D fp (x δ α,x k) d α with d α as in Lemma 3.5 and that Φ α (x k ) > 0. Then the next iterate x k+1 from (3.6) is closer to x δ α than x k with respect to the Bregman distance, i.e. D fp (x δ α,x k+1) D fp (x δ α,x k), if the step-size satisfies β k I k := (0, β k ]. Here (6.3) βk := Φ α(x k ),x k x δ α c q (α) Φ α (x k ), where the constant c q (α) is given by { (6.4) c q (α) := max 1, q 1 ) } q ((r α +A) p 1 +r α with r α and A from Theorem 3. and Lemma.8, respectively. Proof. By the definition of Bregman distance D fp (z,x) and the dual iteration (3.6), we have D fp (x δ α,x k+1) D fp (x δ α,x k) = 1 p x k p 1 p x k+1 p J p (x k+1 ) J p (x k ),x δ α x k = D fp (x k,x k+1 ) β k Φ α (x k ),x k x δ α. As in (.) we rewrite the first term in the dual space, i.e. D fp (x k,x k+1 ) = D fq (J p (x k ),J p (x k+1 )).

4 WEI WANG, STEPHAN W. ANZENGRUBER, RONNY RAMLAU, AND BO HAN According to Lemma 3.5 we have x k x δ α r α due to D fp (x δ α,x k) d α and, consequently, β k β r α k Φ α (x k ) holds for β k from (6.3). Thus, J p (x k ) = x k p 1 (r α +A) p 1 J p (x k+1 ) J p (x k ) +β k Φ α (x k ) (r α +A) p 1 +r α and we may apply (.7) with constant c q = c q (α) from (6.4) to obtain (6.5) Altogether we have shown D fq (J p (x k ),J p (x k+1 )) c q (α) J p (x k+1 ) J p (x k ) = c q (α) β k Φ α(x k ). D fp (x δ α,x k+1) D fp (x δ α,x k) g(β k ), where g(β k ) := c q (α)βk Φ α (x k ) β k Φ α (x k ),x k x δ α satisfies g(β k ) < 0 for small values of β k as according to Proposition 3.3 we have Φ α (x k ),x k x δ α = ϕ 1 (0) > 0. Now, g(β k) is minimized if c q (α) β k = Φ α(x k ),x k x δ α Φ α (x k ) and hence x k+1 is closer to x δ α than x k with respect to the Bregman distance for β k (0, β k ]. Proof of Theorem 3.7. The idea of the proof is to establish (6.6) Φ α (x k ),x k x δ α γαd f p (x δ α,x k) γα c k, so that, consequently, the right hand side in (3.7) is a lower bound for β 0 in (6.3) that is independent of x δ α. Once this is verified, we may appeal to Proposition 6.3 to obtain the assertion. The left hand side inequality in (6.6) has already been proven in Proposition 3.3 and it remains to show D fp (x δ α,x k ) c k. Note that, if D fp (x δ α,x k) 1 this is evidently true. Therefore, we only consider the case D fp (x δ α,x k ) < 1 from here on. Now, we use the minimizing property of x δ α and Φ α(x k ) 0 to obtain 0 < Φ α (x k ) φ j,k Φ α (x k ) Φ α (x δ α ). To bound the right hand side in terms of D fp (x k,x δ α) we note that Φ α (x k ) = 1 F(x k) F(x δ α ),F(x k) y δ + 1 F(xδ α ) yδ,f(x k ) y δ +αf p (x k ) Φ α (x δ α ) = 1 F(xδ α ) F(x k),f(x δ α ) yδ + 1 F(x k) y δ,f(x δ α ) yδ +αf p (x δ α ) and, for all x l p, Φ α (x),x δ α x k = F(x) y δ,f (x)(x δ α x k) +α f p (x),xδ α x k.

GLOBAL MINIMIZATION OF TIKHONOV FUNCTIONALS 5 Due to Φ α (x δ α ) = 0 for the minimizer xδ α, the latter yields Φ α (x k ) Φ α (x δ α) = R Φα (x δ α,x k ) = 1 F(x k) F(x δ α),f(x k )+F(x δ α) y δ + F(x δ α) y δ,f (x δ α)(x δ α x k ) +αd fp (x k,x δ α) = 1 F(x k) F(x δ α) + F(x δ α) y δ,r F (x k,x δ α) +αd fp (x k,x δ α) = 1 F (x δ α )(xδ α x k) + F(x δ α ) yδ,r F (x k,x δ α ) +αd f p (x k,x δ α ) F (x δ α )(xδ α x k),r F (x k,x δ α ) + 1 R F(x k,x δ α ). The Bregman distance is not symmetric and to derive the required estimates, we will bound D fp (x k,x δ α) in terms of D fp (x δ α,x k ). Using x δ α A according to Lemma.8 and x k x δ α r α, this is possible as D fp (x k,x δ α ) c p x δ α x k p and x δ α x k 1 c A D fp (x δ α,x k) follows from Lemma.3 with c A := p 1 (r α+a) p. In combination with the estimates for R F (x k,x δ α) and F (x δ α) in Remark.6 and Lemma.8, respectively, we thus obtain by applying Cauchy s inequality Φ α (x k ) φ j,k 1 c A ( K +L F(x δ α ) yδ +LKr α + L r α 4 +α c p ) D fp (x δ α,x k) c p/ A D fp (x δ α,x k ) p/. Finally, for α α as in (.10) we know from (.11) that F(x δ α ) yδ s α and as we only consider the case D fp (x δ α,x k ) < 1, Φ α (x k ) φ j,k 1 ( ) K +Ls α+lkr α + L rα +α c p c p A D 1/ f c A 4 p (x δ α,x k ) holds. Collecting the above estimates, we have established D fp (x k,x δ α ) c k. Hence, (6.6) holds true and by virtue of the inital arguments and Proposition 6.3, the proof is complete. 6.5. Proof of Proposition 3.8. As in the previous section, we simply write x k, β k and α neglecting the dependence on j, which is assumed to be fixed. Lemma 6.4. If Assumption.5 holds, then to each α > α there exists κ α > 0 such that Φ α (x) κ α holds for all x X satisfying D fp (x δ α,x) d α with d α from Lemma 3.5.

6 WEI WANG, STEPHAN W. ANZENGRUBER, RONNY RAMLAU, AND BO HAN Proof. Recall that Φ α (x) = F (x) (F(x) y δ )+αj p (x) and that x x δ α r α holds with r α as defined in (3.3) due to Lemma 3.5. Using Assumption.5, Theorem.7 as well as Lemma.8, we obtain F (x) F (x) F (x δ α) + F (x δ α) Lr α +K F(x) y δ R F (x δ α,x) + F (x)(x δ α x) + F(xδ α ) yδ cd fp (x δ α,x)+(lr α +K)r α +s α x r α +A, where R F (x δ α,x) is as defined in (6.1). Collecting the above estimates, it thus follows that Φ α (x) F (x) F(x) y δ +α x p 1 κ α, where κ α := (Lr α +K) ( cd α +(Lr α +K)r α +s α ) +α ( r α +A ) p 1 and the proof is complete. In preparation for the proof of Proposition 3.8 we need another result for the dual gradient descent method. Proposition 6.5. Let Assumption.5 hold and suppose that D fp (x δ α,x k) d α holds with d α from Lemma 3.5. Then there exists a constant M α > 0, such that for all β [0,T α ], where r α T α := c q (α)c α with r α, c q (α) and C α as in (3.3), (6.4) and (3.13), respectively, x k (β) defined by satisfies J p (x k (β)) = J p (x k ) β Φ α (x k ), x k (β) = J q (J p (x k (β))) R Φα (x k (β),x k ) M α D fp (x k (β),x k ). Proof. Arguing in a similar way as in the proof of Theorem 3., we get R Φα (x k (β),x k ) = 1 R F(x k (β),x k )+F (x k )(x k (β) x k ) R F (x k (β),x k ),F(x k ) y δ ) +αd fp (x k (β),x k ), with R F and R Φ as defined in (6.1) and (3.), respectively. To apply (.4) in Lemma.3, we use x k (β) = J p (x k (β)) p 1 and (a + b) p 1 a p 1 + b p 1 for a,b 0, p (1,] as well as Lemmata.8, 3.5 and 6.4 to estimate x k x k x δ α + xδ α r α +A ( ) p 1 x k (β) J p (x k ) +β Φ α (x k ) rα +A+(T α κ α ) p 1 x k (β) x k (r α +A)+(T α κ α ) p 1.