DIPLOMA THESIS. Bc. Kamil Vasilík Linear Error-In-Variable Modeling

Size: px

Start display at page:

Download "DIPLOMA THESIS. Bc. Kamil Vasilík Linear Error-In-Variable Modeling"

Pierce Nicholson
5 years ago
Views:

1 Charles University in Prague Faculty of Mathematics and Physics DIPLOMA THESIS Bc. Kamil Vasilík Linear Error-In-Variable Modeling Department of Numerical Mathematics Supervisor: RNDr. Iveta Hnětynková, PhD. Study programme: Mathematics, Numerical and Computational Mathematics 2

2 I would like to thank to my supervisor RNDr. Iveta Hnětynková, PhD. for careful leading and valuable advice and help with writing this thesis. I thank to my family and friends for motivation and support. I declare that I carried out this master thesis independently, and only with the cited sources, literature and other professional sources. I understand that my work relates to the rights and obligations under the Act No. 2/2 Coll., the Copyright Act, as amended, in particular the fact that the Charles University in Prague has the right to conclude a license agreement on the use of this work as a school work pursuant to Section 6 paragraph of the Copyright Act. In Prague, Kamil Vasilík 2

3 Contents Introduction 7 Theoretical Background 9. Linear Error-In-Variables Models Regularization Golub-Kahan Iterative Bidiagonalization Golub-Kahan Bidiagonalization and Riemann-Stieltjes Integral Revealing of Noise Determination of the Noise Level Point of Noise Revealing Stagnation-Based Criterion Closest-to-Origin Criterion Minimum-Point Criterion Further Suggestions Different Colors of Noise 5 3. Basic Concepts White noise Color of noise Noise Propagation Loss of Orthogonality and Noise Revealing 6 4. Noise Propagation Noise Revealing Criteria Conclusion 69 3

4 Bibliography 7 4

5 Název práce: Lineárne algebraické modelovanie úloh s nepresnými dátami Autor: Bc. Kamil Vasilík Katedra (ústav): Katedra numerické matematiky Vedoucí bakalářské práce: RNDr. Iveta Hnětynková, PhD. vedoucího: hnetynkova@cs.cas.cz Abstrakt: V predloženej práci sledujeme úlohy Ax b, ktoré pochádzajú z diskretizácie ill-posed problémov, kde pravá strana b obsahuje (neznámy) šum. V [29] je ukázané, že za určitých prirodzených podmienok, s použitím Golub-Kahanovej iteračnej bidiagonalizácie, môže byť veľkosť hladiny šumu odhadnutá za zanedbateľnú cenu. Takáto informácia môže byť ďalej použitá pri riešení ill-posed problémov. V práci navrhujeme kritéria pre detekciu iterácie vyjavujúcej šum v Golub-Kahanovej iteračnej bidiagonalizácii. Rozoberáme prítomnosť šumu rôznych farieb. Študujeme, ako strata ortogonality ovplyvní šum vyjavujúcu vlastnosť bidiagonalizácie. Klíčová slova: ill-posed úlohy, regularizácia, Golub-Kahanova iteračná bidiagonalizácia, Lanczosova tridiagonalizácia, šírenie šum, strata ortogonality 5

6 Title: Linear Error-In-Variable Modeling Author: Bc. Kamil Vasilík Department: Department of Numerical Mathematics Supervisor: RNDr. Iveta Hnětynková, PhD. Supervisor s address: hnetynkova@cs.cas.cz Abstract: In this thesis we consider problems Ax b arising from the discretization of ill-posed problems, where the right-hand side b is polluted by (unknown) noise. It was shown in [29] that under some natural assumptions, using the Golub-Kahan iterative bidiagonalization the noise level in the data can be estimated at a negligible cost. Such information can be further used in solving ill-posed problems. Here we suggest criteria for detecting the noise revealing iteration in the Golub-Kahan iterative bidiagonalization. We discuss the presence of noise of different colors. We study how the loss of orthogonality affects the noise revealing property of the bidiagonalization. Keywords: ill-posed problems, regularization, the Golub-Kahan iterative bidiagonalization, the Lanczos tridiagonalization, noise propagation, loss of orthogonality 6

7 Introduction Vision is the art of seeing what is invisible to others. Jonathan Swift In many fields of applications, e.g. computer tomography, signal and image processing, seismology, geology, etc., there arises a need to solve linear approximation problems Ax b, A R n m, x R m, b R n, that result from discretization of ill-posed problems. These problems are difficult to solve, because usually a small perturbation of the right-hand side b, that represents observations, causes large change of the solution x, that stands for the unknown real data. The matrix A, which represents a discretized model, is usually ill-conditioned, singular values of A decay gradually without noticeable gap and/or most of them are very small. Often the right-hand side b is polluted by some kind of unknown noise b noise of different origin (blur in image processing, noise in signal processing, effects of rounding errors, etc.). Formally, we write b = b exact + b noise R n, where only b is known. In computation noise is often unwanted phenomenon. Noise can be coarse or soft, it is often described in terms of colors (white, pink, red, blue, violet, grey, etc.). The most important characteristic of noise is its unpredictability. Real noise cannot be fully predicted, but it can be simulated with various mathematical formulas. This gives us chance to study the effect of noise in numerical computations. Least Square methods [9, 27,, 53] often fail in this case, because the obtained solution is dominated by amplified noise [7, 25]. It is necessary to use regularization techniques [, 8,, 5, 3, 43, 47] in order to find a proper 7

8 approximation of the solution. Regularization methods based on the singular value decomposition, e.g. Truncated Least Squares [, 53] or Tikhonov regularization [9, 8], are usually used for smaller problems. Projection regularization methods, suitable also for large scale problems, are typically based on Krylov subspace iterations, where regularization is done by the projecting the original problem onto a Krylov subspace of smaller dimensions, e.g. LSQR [43, 44], CGLS [26], CGNE [5], and [33, 52]. Many of these methods are based on the Golub-Kahan iterative bidiagonalization [3]. Hybrid methods combine iterative and direct regularization. Here the original problem is first projected onto a Krylov subspace and then some direct regularization is applied on the projected problem, see e.g. [2, 3, 8, 5, 32]. Amount of regularization in every regularization method is controlled by a regularization parameter. Methods for choosing regularization parameters can be divided into two groups. In the first group there are methods based on a priori knowledge of the size of noise in the data, i.e. Discrepancy Principle [7, 39, 4]. The second group contains methods that work without any a priori information, e.g. L-curve [35, chapter 26] and [7, 8, 22, 32], Generalized Cross Validation [32, 54], NCP method [5, 5, 24], etc. In [29], it was shown that the Golub-Kahan iterative bidiagonalization can be used to estimate the (unknown) noise level in the data very cheaply. In order to get a good estimate it is necessary to identify the so-called noise revealing iteration. The obtained approximation of the noise level can further be used in construction of stopping criteria in regularization methods based on the Golub-Kahan iterative bidiagonalization. The main goal of this thesis is to study the noise revealing property of the Golub-Kahan iterative bidiagonalization. We suggest automated noise revealing criteria for the Golub-Kahan iterative bidiagonalization, in order to approximate the noise level. Mostly in the literature, e.g. [23, 24, 29, 5, 5] noise present in the right-hand side is said to be white or it is said that it can be whitened. We discuss how noise of different colors propagates in the Golub-Kahan iterative bidiagonalization. We look at the effects of the loss of orthogonality on the noise revealing property of the bidiagonalization, when it is computed without reorthogonalization. 8

9 Chapter Theoretical Background It is the theory that decides what can be observed. Albert Einstein. Linear Error-In-Variables Models Consider a system of linear algebraic equations Ax b, A R n m, b R n (.) where A is a nonzero matrix, b is a nonzero vector. We assume that A T b, otherwise b cannot be approximated by the columns of A and the problem (.) has only the trivial solution x =. Further we assume that both A and b are real, but the extension to the complex case is possible. Such linear approximation models come from a wide class of applications. Often the matrix A stands for a discretized smoothing operator, the vector b represents an observation data and the vector x is the (unknown) real solution to the problem. In many cases the matrix A may be ill conditioned, i.e. the ratio of the largest and the smallest singular value is enromnous. Also the system (.) may be ill-posed. At the beginning of the 2th century Jacques Hadamard came with the ideas of well-posed and ill-posed problems, see [7, 25] for references. He defined a problem as ill-posed, if its solution is nonunique or if it is a problem where a small perturbation of an observation data causes damaging changes to the solution, therefore the solution is not continuously dependent on the output data. If the numerical rank of A is well defined and there exist noticeable gap 9

10 among large and small singular values, the problem (.) is called rankdeficient, [7, 25]. In this case one or more columns (or rows) of A are almost numerically dependent on the others, therefore the system contains redundant information that can unfavourably affect the dependence of the solution and the right-hand side. In many cases the numerical rank of A is not well defined and singular values decay to zero without a noticeable gap. Both cases are illustrated in Figure.. 5 Singular values of shaw() singular values Singular values of parallax() singular values Figure.: The matrix A of the problem shaw() (left) form Regularization Tools [2] has well defined numerical rank due to the good separation of large and small singular values, singular values of the matrix A of parallax() (right) decay without any noticeable gap. Usually, the right-hand side b is polluted by some kind of unknown noise which can occur as a result of blur (in image processing) or physical noise (in signal processing), or any kind of unpredicted circumstances which happen during the recording of the data. Noise can also result from rounding errors that occur during numerical computations from the already mentioned illposedness of the system. We can therefore write the right-hand side as b = b exact + b noise R n, (.2) where b exact are the exact data and b noise is the noisy part of the right-hand side, however we know only the vector b. We assume noise present in b to be sufficiently small comparing to the exact data, b noise b exact, where. is standard Euclidean norm. Let us call a noise level the ratio δ noise = b noise / b exact. Usually we will assume b noise to be white noise,

11 however in Chapter 3 we also study presence of different colors of noise in the right-hand side. Noise is called white if it has uncorrelated components, zero mean and constant deviation, e.g. [24]. Fourier coefficients of the white noise b noise have comparable size, see also Section 3.2. Properties of the problem For simplicity of exposition, throughout the whole Chapter, we consider A R n n to be nonsingular squared matrix (i.e. n = m), although it is possible to extend the results to a general rectangular matrix, n m. Therefore instead of (.), we actually solve Ax = b, b = b exact + b noise, (.3) where b noise is white noise, unless we specify otherwise. Consider the singular value decomposition (SVD) A = UΣV T = n σ j u j vj T, (.4) j= where U = [u,...,u n ], U = U T, V = [v,...,v n ], V = V T and their columns are left and right singular vectors of A, and Σ is a diagonal matrix with singular values σ... σ n > of A on the diagonal. Using the notation from (.4), we get the solution x as x = A b = VΣ U T b = n i= u T i b σ i v i. (.5) We will see that the ill-posed character of the problem (.3) and the division by the smallest singular values in (.5) cause the solution x to be unsatisfactory. Naive solution Let us look at the following example from image deblurring. Consider a black-and-white picture of the length p and the width q. It can be represented by matrix of dimensions p and q where every pixel has a value

The optical system used to obtain the picture b can be represented by the mapping operator A. We assume that this operator is well described, so the matrix A does not contain any errors.

12 in the range from (black) to 255 (white). Let b i R p, i =,...,q be the vector containing values of the ith column of the observed picture, then b = (b T,...,b T q ) T represents the vectorized form of the observed data. Working with color pictures is described e.g. in Deblurring Images [6, chapter 2]. The optical system used to obtain the picture b can be represented by the mapping operator A. We assume that this operator is well described, so the matrix A does not contain any errors. We can write our problem as (.3), where A R pq pq, b R pq represents the observed data and x R pq represents the vectorized solution of (.3). Figure.2 (left) shows an example of observed data, the blurry image. right hand side B naive solution Figure.2: Blurry right-hand side b polluted by noise (left) and the naive solution (right). Such blur can occur for example from an out-of-focus lens of the optical system, a motion of the observed object, a heat wave of air, etc. If we try to solve the corresponding problem by computing the solution (.5), we obtain what can be seen in Figure.2 (right). The presence of noise b noise in the right-hand side b causes that the solution (.5) is fully undecipherable. This is the reason why this solution is called naive. Using the SVD of A, the formula (.5) can be written as x naive = A (b exact + b noise ) = n i= u T i b exact σ i v i + n u T i b noise v i. (.6) σ i i= where the first sum represents the exact solution x exact = A b exact and the second sum represents the noisy component x noise = A b noise. The vector 2

b exact usually satisfies the discrete Picard condition, see [7, 3], i.e. the absolute values of projections of the exact right-hand side b exact to the left singular subspaces u i decay, on average,

The size of components u T i b noise is comparable for all i (recall that we assume white noise), none of the projections is noticeably greater than the others.

13 b exact usually satisfies the discrete Picard condition, see [7, 3], i.e. the absolute values of projections of the exact right-hand side b exact to the left singular subspaces u i decay, on average, faster than the size of singular values σ i corresponding to these singular vectors. On the other hand, noise b noise present in the right hand side does not satisfy this condition. The size of components u T i b noise is comparable for all i (recall that we assume white noise), none of the projections is noticeably greater than the others. Since we divide these projections by singular values, components ut i bnoise σ i corresponding to small singular values get larger and the second sum becomes dominant over the first sum in the solution x naive, therefore x exact = A b exact << A b noise. (.7) In Figure.3 (left), we can see that projections u T i b (red dots) decay as fast as singular values of A (blue line), until they start to fluctuate at the noise level δ noise = b noise / b exact (black line). Therefore, from this point the useful information in the solution (.6) is ruined. Projections u T j b /σ j (green dots) show how fast sizes of the components of the second sum in (.6) raise. TSVD solution, k = Figure.3: Singular values σ j (blue line) of A, projections u T j b (red dots), u T j b /σ j (green dots) and the noise level δ (black line) (left). Regularized TSVD solution (right) with l = Figure.4 shows similar results for the problem shaw() from [2]. This problem comes from a discretization of the Fredholm integral equation of the first kind with a smooth kernel. The difference between the exact (right) and naive (left) solution is highly noticeable. 3

14 8 x by Gauss elimination 2.5 x exact Figure.4: The naive solution of the problem shaw() (left) and the exact solution (right). It will be useful to summarize the following properties of the problem (.3). Singular vectors For discrete ill-posed problems the singular vectors u i and v i from (.4) usually tend to have an increasing number of sign changes in their elements as the index i increases. If the singular value σ i gets smaller, the frequency of the corresponding singular vectors gets higher, for detailed analysis see [24, 23]. We can see that in Figure.5. u u 2 u 3 u 4 u 5 u 6 u 7 u v v 2 v 3 v 4 v 5 v 6 v 7 v Figure.5: The first eight left u i (top) and right v i (bottom) singular vectors for the problem shaw(). 4

15 System matrix Since the matrix A usually comes from discretization of an integral equation, such as Fredholm integral equations of the first kind, for details see [7], it is natural to assume that A has a smoothing property. For example, imagine A as a numerical model of lenses of a camera, a solution x as a real image of an object and a right-hand side b as a picture taken by the camera, in the example presented in Figure.2. If lenses are out of focus, sharp edges of the object become smoother. Assume b in the form (.2). The vector b exact results from an operation of a smooth integral operator applied on the vector x. Therefore b exact is supposed to be dominated by the low frequency components. In other words, b exact is smooth. If b noise has white noise character, there is no domination in frequencies. Since u i have tendency to oscillate as i increase, for the small indices i the spectral components u T i b are dominated by the exact part of the right-hand side, while for the larger i the spectral components are dominated by u T i b noise, see analysis in [24, 23]. The inverse matrix A in (.5) has a tendency to strengthen high frequency components of the right-hand side b which gives a different view on why noise is amplified in (.5)..2 Regularization We saw that the presence of noise in the right-hand side and properties of the system (.3) cause that it is difficult to find a good approximation of the exact solution A b exact and the system (.3) cannot be solved directly. Therefore it is necessary to consider proper regularization techniques to solve this kind of problems, e.g. see [, 8,, 5, 3, 43, 47]. Regularization techniques try to find a proper approximation of the solution that is not sensitive to and/or dominated by the errors. Such techniques are either direct (e.g. based on SVD like Truncated Least Square [, 53], or Tikhonov regularization [9, 8], usually used for smaller problems), projection methods (usually used for large scale problems, often based on Krylov subspace iterations [33, 52, 43, 44]), or hybrid methods combining iterative and direct regularization, e.g. [2, 3, 8, 5, 32]. Truncated Least Squares (T-LS) method [9, 27,, 53], also called Truncated SVD (TSVD), can be considered as the simplest regularization technique based on the SVD of A. Here we substitute the original matrix A by 5

16 a well-conditioned matrix with smaller numerical rank, in order to eliminate components dominated by errors. Let us assume SVD of A as in (.4), but we substitute size of problem n by the numerical rank r of the matrix A. Then according to the Eckart- Young-Mirsky theorem [7], the closest approximation of A by a matrix of the rank l, < l < r, is the matrix A l = l i= u iσ i vi T. Then the TSVD solution can be expressed as x TSV D l = l i= u T i b σ i v i, (.8) where l is called the truncation level and it represents the regularization parameter of the TSVD method. Although this method is quite simple, it may have good regularization properties in some cases. Let us use this method to solve the problem (.3), where our right-hand side vector is the image poker.jpeg (Figure.6 - top left). This image has dimensions 7 7 and was blurred with white noise with the noise level 3 using the package of Matlab routines attached to the book Deblurring Images [6]. As we can see in Figure.6, for the truncation level l = 2 (bottom right) the image stays totally unrevealed as it was in the case of Figure.3 (right). This means that regularization is not sufficient, the solution is under-regularized. Noise present in the image is still strong enough to suppress the solution, therefore there is a need to regularize the problem more. As we reduce the truncation level, the image focuses and for approximately l = 7 (middle right) we have the best regularization of the image. As we reduce the truncation level even more, there comes overregularization, i.e. we have omitted too much information of the original problem and the approximation is not good, as shown for l = 3 (middle left). Therefore choosing proper regularization parameter is very important. 6

(middle right), l = 9 (bottom left), l = 2 (bottom right).

17 Figure.6: Real picture (top left), blurry picture (top right), TSVD solution with different truncation level l: l = 3 (middle left), l = 7 (middle right), l = 9 (bottom left), l = 2 (bottom right). On the following example we illustrate how the regularization parameter changes with the different noise level. Figure.7 shows different solutions for different noise levels. On the first figure (top left), there are singular values σ i of the problem shaw() and projections u T i b for the noise levels δ noise = b noise / b exact set to 4, 8 and 4. 7

18 5 Singular values, projections and noise level for problem shaw σ j u j T b for δ noise = 4 u j T b for δnoise = 8 u j T b for δ noise = Real and regularized solutions for noise level δ noise = 4 real solution regularized for l = 4 regularized for l = 8 regularized for l = Real and regularized solutions for noise level δ noise = 8 Real and regularized solutions for noise level δ noise = 4 real solution regularized for l = regularized for l = 8 regularized for l = real solution regularized for l = 6 regularized for l = 3 regularized for l = Figure.7: Solutions of the problem shaw(), SVD and projections u T i b for the noise levels 4, 8 and 4. As we can see, for small noise level (top right) that is close to the machine precision (or close to the smallest singular values of A), there is practically no need to use regularization to get approximate solution. As the noise level gets higher, the regularization is necessary. For the noise level δ noise = 8 (bottom left) the optimal truncation level in TSVD seems to be l = 3, because it is the first index for which the projection u T i b < δ noise. For the noise level δ noise = 4 (bottom right) it is obvious that the optimal regularization parameter gets very small and we cannot afford to use more information because it would destroy the solution. The proper regularization parameter seems to be l = 8 and since the noise level is quite high, it is very important not to under- or over-regularize the solution. 8

19 .3 Golub-Kahan Iterative Bidiagonalization One of the most popular Krylov subspace methods with regularization properties is the Golub-Kahan iterative bidiagonalization, e.g. [5, 43, 44]. Given the initial vectors w, s b/β, where β b, the Golub-Kahan iterative bidiagonalization algorithm computes for j =, 2,... α j w j = A T s j β j w j, w j =, β j+ s j+ = Aw j α j s j, s j+ = (.9) until α j = or β j+ =, or until the process reaches the dimensionality of the problem, in our case n. Let S k = [s,...,s k ] and W k = [w,...,w k ] be matrices with left and right bidiagonal vectors as their orthonormal columns, and the matrices L k and L k+ have the form L k = α β 2 α [, L k+ = L k β k+ e T k ], β k α k where e T k is the kth vector of standard Euclidean basis. Then the first k steps of the Golub-Kahan iterative bidiagonalization can be written as A T S k = W k L T k, AW k = [S k,s k+ ]L k+, (.) see [2, 3, 3]. The columns of W k form an orthonormal basis of the kth Krylov subspace K k (A T A,A T b) and the columns of S k form an orthonormal basis of the kth Krylov subspace K k (AA T,b), [3]. Consider we seek an approximation to the solution of (.3) in the Krylov subspace K k (A T A,A T b), i.e. we search y k such that AW k y k b, where x k = W k y k. If we wish to minimize the norm of residuum r k = b AW k y k, then y k has to satisfy L k+ y k e β = min y L k+ y e β. (.) In other words, y k is the Least Squares solution of the projected problem L k y k β e, see [, ]. This corresponds to the LSQR [43, 44] or CGLS [26] methods that are mathematically equivalent. 9

20 If the approximation of the solution is based on ensuring the orthogonality of the residual r k to the Krylov subspace generated by the columns of S k, then we have L k y k = e β, (.2) which corresponds to the CGNE method [5]. The projection onto the Krylov subspace with dimension k represents itself a form of (outer) regularization where the dimension k stands for the regularization parameter. For small k there is not enough information in the projected problem to reconstruct the solution (over-regularization). If k gets larger, noise may transfer to the projected problem (under-regularization). Thus it is often reasonable to compute more iterations (use larger k) of the bidiagonalization and then regularize the projected problem L k+ y k e β or (.2) using some inner (usually direct) regularization. Methods combining outer and inner regularizations are called hybrid methods, see [2, 3, 8, 5, 32]. Parameter-choice methods As we saw in Section.2, the regularization parameter plays an important role in the regularization. Methods for choosing regularization parameters can be divided into two main groups. The first are based on an a priori knowledge of the size of the norm of the noise b noise or at least on its good approximation. One of the most widespread method is the Discrepancy Principle, [7, 39, 4]. It chooses a regularization parameter as the one for which the norm of the residuum of the regularized solution x reg is close to b noise, i.e. Ax reg b = τ, (.3) where τ > is some real number. For a discrete problem solved by TSVD, the proper regularization parameter is usually chosen as the smallest l such that Ax reg l b. The second group of the parameter-choice methods contains those methods that do not need any a priori information. The L-curve criterion, see [35, chapter 26] and [7, 8, 22, 32] is one of the most convenient graphical tools for analysis of discrete ill-posed problems. The L-curve plots for all the regularization parameters the norm of the regularized solution x reg 2 versus the corresponding residual norm Ax reg b 2 in the log-log scale to emphasize the L-shape of the curve. The proper regularization parameter is the one 2

21 for which these norms are balanced, therefore the point which corresponds to the corner of the curve. The Generalized Cross Validation [32, 54] minimizes a functional connected with certain property of the residuum. The NCP method is based on the normalized cumulative periodogram (NCP), see [5, 5, 24]. The key idea of the NCP method is to choose a regularization parameter l for which the residual vector r l = b Ax reg l changes from being dominated by remaining signal to being white-noise-like (under the assumption that the right-hand side is contaminated by white noise)..4 Golub-Kahan Bidiagonalization and Riemann-Stieltjes Integral It was shown in [34, 45] that the Golub-Kahan bidiagonalization is closely connected to the Lanczos tridiagonalization (or Lanczos three-term recurrence). The Lanczos tridiagonalization of the symmetric matrix AA T with the starting vector s = b/β, β = b computes after k steps a matrix T k such that AA T S k = S k T k + α k β k+ s k+ e T k, (.4) and T k is a symmetric tridiagonal matrix with positive elements on subdiagonals (Jacobi matrix), α 2 α β 2 T k = L k L T α k = β 2 α2 2 + β α k β k. (.5) α k β k α 2 k + β2 k As we can see, the matrix L k from the Golub-Kahan bidiagonalization stands for a Cholesky factor of the matrix T k, further explained in [28, 45, 46]. Let us consider the singular value decomposition (SVD) of the bidiagonal matrix L k, L k = P k Θ k Q T k, (.6) where P k = P T k, Q k = Q T k, and Θ k is a diagonal matrix with the singular values θ (k),...,θ (k) k of L k on its diagonal ordered in an increasing order. This order comes from the standard ordering of Ritz values (eigenvalues of the 2

22 projected matrix) in the Lanczos method, [34, 37]. From (.5) it follows that T k L k L T k = P k Θ 2 kp T k (.7) is the spectral decomposition of T k, (θ (k) l ) 2 are its eigenvalues and p (k) l = P k e l are its eigenvectors, l =,...,k. Further, using the SVD of A as in (.4), AA T = UΣ 2 U T (.8) is the spectral decomposition of the matrix AA T, σ 2 j are its nonzero eigenvalues and u j are the corresponding eigenvectors, j =,...,n. Relations to orthonormal polynomials In this section we fully follow the important theoretical interpretations and results from the paper of Meurant and Strakoš [37]. Piecewise constant distribution function ω N ω ω 2 ω 3 ω 4 a λ λ 2 λ 3 λ 4 λ N b Figure.8: Random piecewise constant distribution function with nodes λ,...,λ N and weights ω,...,ω N. The Riemann-Stieltjes integral of a continuous function f(λ) defined on a closed interval [a,b], where a < λ,λ N < b satisfies b a f(λ)dω(λ) N ω j f(λ j ), (.9) j= 22

23 where ω(λ) is a piecewise constant distribution function, λ j are N points of increase and ω j, j =,...,N are weights that satisfy if λ < a, k ω(λ) = j= ω j if λ j λ < λ j+, N j= ω j = if b λ, see [6]. An example of a piecewise constant distribution function is shown in Figure.8. We can construct a set of N + orthogonal polynomials φ,φ,...,φ N with respect to the Riemann-Stieltjes integral, b a φ i (λ)φ j (λ)dω(λ) =, i j, (.2) where the polynomial φ i is of degree i, see [26, 42]. In order to obtain such orthogonal polynomials, we can use Lanczos three-term recurrence. Lanczos basis vectors s j, j =,...,k can be rewritten using (.4) as s j+ = ψ j (AA T )s, (.2) where ψ j (λ) are the so-called Lanczos polynomials. These polynomials are monic and satisfy the three-term recurrence ψ, ψ =, β j+ ψ j+ (λ) = (λ α j )ψ j (λ) β j ψ j (λ) (.22) for j =,...,k which is basically nothing else than the Lanczos recurrence. From (.8) we have ψ j (AA T )s = Uψ j (Σ 2 )(U T s ) = n (s,u l )ψ(σl 2 )u l. (.23) l= We know that vectors s j are orthonormal, (s i+,s j+ ) = for i j. It can be shown that the Lanczos polynomials are orthogonal with respect to the scalar product defined by the Riemann-Stieltjes integral (ψ i,ψ j ) = σ 2 σn 2 ψ i (λ)ψ j (λ)dω(λ) = n (s,u l ) 2 ψ i (σl 2 )ψ j (σl 2 ), (.24) l= 23

24 where the distribution function ω is a non-decreasing piecewise constant function with at most n points of increase σn,...,σ 2. 2 Let for simplicity σn 2 <... < σ, 2 i.e. the eigenvalues of AA T are distinct. Then we can say about the weights ω l that if λ < σn, 2 ω(λ) = ω n + ω n ω i if σi+ 2 λ < σi 2, n l= ω l = if σ 2 λ, where ω l = (s,u l ) 2 is the squared component of the starting vector s in the direction of the lth invariant subspace of AA T. Furthermore, let us consider the matrix T k, the symmetric tridiagonal matrix defined in (.4) and (.5). As we know, it stores the coefficients of the Lanczos tridiagonalization process applied to the matrix AA T with the starting vector s. The same matrix T k can be obtained as a result of the Lanczos tridiagonalization of the matrix T k with the starting vector e (the vector of the standard Euclidean basis), T k I = IT k +. We can construct Lanczos polynomials similarly as above, using (.7) we get k e j+ = ϕ j (T k )e = P k ϕ j (Θ 2 k)(p T k e ) = (p (k) l,e )ϕ j ((θ (k) l ) 2 )p (k) l. (.25) We know that the vectors of the Euclidean basis are orthonormal with respect to the scalar product defined by the Riemann-Stieltjes integral (ϕ i,ϕ j ) k = = (θ (k) k )2 l= ϕ i (λ)ϕ j (λ)dω (k) (λ) = (θ (k) )2 k (p (k) l,e ) 2 ϕ i ((θ (k) l ) 2 )ϕ j ((θ (k) l ) 2 ), (.26) l= where the distribution function ω (k) is a non-decreasing piecewise constant function with k points of increase (θ (k) ) 2,...,(θ (k) k )2 and with the weights ω (k) l that satisfy if λ < (θ (k) ω (k) ) 2, (λ) = i l= ω(k) l if (θ (k) i ) 2 λ < (θ (k) i+ )2, n l= ω(k) l = if (θ (k) k )2 λ, 24

25 where ω (k) l = (p (k) l,e ) 2 is the squared first component of the normalized eigenvector of T k. The inner products defined by (.24) and (.26) have the same moments up to the degree 2k (in our case), see [9]. For the polynomials η of degree at most 2k it holds (,η) k = (,η). This statement comes from the connection to the Gauss-Christoffel quadrature and the theory of the moments, see [55, 37]. According to Theorem 2. in [37] and summarizing the results above, the Lanczos tridiagonalization process (.4) generates at each step k the non-decreasing piecewise constant distribution function ω (k) with the nodes (θ (k) l ) 2 and the weights (p (k) l,e ) 2 that approximates the distribution function ω with the nodes σn,...,σ 2 2 and the weights (b/β,u n ) 2,..., (b/β,u ) 2, see [37, 9, 26, 34]. This approximation property is further used for the determination of the noise level in Section.5..5 Revealing of Noise As we mentioned in Section.3, the Golub-Kahan iterative bidiagonalization has regularization properties. We will follow the paper [29] to describe what happens during the bidiagonalization. Consider the vectors w k and s k generated by the Golub-Kahan iterative bidiagonalization algorithm (.9). The starting vector s = b/ b is contaminated by noise. Let us look on how the subsequent vectors s k behave. In Figure.9, there are vectors s k, k =,...,2 for the problem baart(4) from [2] with the noise level δ noise = computed by the bidiagonalization with double reorthogonalization. In Section 4. we will look on s k computed by the bidiagonalization without reorthogonalization. As we can see, noise is not visible during the first steps of the bidiagonalization until it is strongly amplified in s 8, then it partially filters out and it appears again in s 2. For indices k > 2 the vectors s k strongly oscillate. In [29] it was explained how noise is transferred during the bidiagonalization process. 25

26 s s 2 s 3 s s s s s s s s s Figure.9: The first twelve vectors s k from the bidiagonalization for the problem baart(4) with the noise level δ noise =. The first step of the bidiagonalization generates the vector s 2 as follows: β 2 α s 2 = α (Aw α s ) = AA T s α 2 s. (.27) We know that A, respectively AA T, has smoothing property, thus the first term AA T s is smooth. However, the second term is the vector s contaminated by noise multiplied by a scalar coefficient. Therefore the contamination of s by the high frequency part of noise is transferred to s 2 with multiplication by a scalar coefficient, while a portion of smooth part of s is subtracted by orthogonalization of s 2 against s. The relative level of high frequency part of noise in s 2 can be expected to be higher than in s. Analogically, the transfer of the high frequency part of the noise can be observed for any k, because the vector s k+ is obtained from AA T s k through the orthogonalization against vectors s k and s k with subsequent normalization. Consequently, the relative level of the high frequency part of noise gradually increases until the low frequency information is projected out. This happened for our example in iteration 8. 26

27 w w 2 w 3 w w w w w w w w w Figure.: The first twelve vectors w k from the bidiagonalization for the problem baart(4) with the noise level δ noise =. In Figure., there are illustrated vectors w k, k =,...,2 for the same problem baart(4). Since A has smoothing property, the recurrence for vectors w k in (.9) starts with the vector w = A T s / A T s that is smoothed. Consequently, all vectors w k are smoothed and compared to the vectors s k they do not contain any significant information about noise. Let us look closely at white noise amplification. Consider a decomposition of the vector s into the exact component s exact and the noise component s noise, so that s = s exact + s noise. The Golub-Kahan iterative bidiagonalization (.9) gives us β 2 s 2 = Aw α s = Aw α (s exact + s noise ), (.28) where Aw is smooth and contains low frequency components of noise which are relatively negligible compared to the low frequency components of the exact data. From (.28) we can define s exact k and s noise k for k =, 2,... as β k+ s exact k+ = Aw k α k s exact k, β k+ s noise k+ = α k s noise k, (.29) β k+ (s exact k+ + s noise k+ ) = Aw k α k (s exact k 27 + s noise k ).

28 Note that s exact k and s noise k do not represent true exact and noise components of s k. If we suppose that multiplication by AA T significantly smooths the high frequency components of s k, then, considered s exact k s noise k, we can expect that s exact k is close to the norm of the true data component and s noise k is close to the norm of the true noise component of s k. From (.29), s noise k+ = α k s noise k, (.3) β k+ i.e. noise is amplified approximately with the ratio α k /β k+, [29]. This amplification can be illustrated by computing the spectral coefficients of s k with respect to the orthonormal left singular vectors u,...,u n and the spectral coefficients of w k with respect to the right singular vectors v,...,v n of the matrix A. U T s U T s 2 U T s 3 U T s 4 U T s 5 U T s 6 U T s 7 U T s V T w V T w 2 V T w 3 V T w 4 V T w 5 V T w 6 V T w 7 V T w Figure.: The first fifteen spectral coefficients U T s k (left) and V T w k (right), k =,...,8 for the problem baart(4) with the noise level δ noise =. In Figure. there are the first 5 spectral coefficients U T s k and V T w k, k =,...,8. Let us discuss their behavior using the analysis of the Golub- Kahan bidiagonalization process (.9) applied to these spectral components. For the first step of the bidiagonalization, k =, we have α (V T w ) = Σ(U T s ), (.3) β 2 (U T s 2 ) = Σ(V T w ) α (U T s ), (.32) and from (.3) we can see that (V T w ) is dominated by the same components as (U T s ), see the plots corresponding to these vectors in Figure.. 28

29 In the following half-step (.32), scaled projection Σ(V T w ) is orthogonalized against U T s. This orthogonalization requires that the dominance in Σ(V T w ) and U T s is cancelled out in order to obtain the orthogonality U T s 2 U T s. If the dominance is significant, we can expect β 2 α. Therefore the norm of α /β 2 is likely to be significantly larger than one, for detailed analysis see [29]. Similar study can be done for a general step k = 2, 3,... α k (V T w k ) = Σ(U T s k ) β k (V T w k ), (.33) β k+ (U T s k+ ) = Σ(V T w k ) α k (U T s k ). (.34) From (.33), since the dominance in Σ(U T s k ) and (V T w k ) is shifted by one component, see the corresponding vectors in Figure., we cannot expect a significant cancellation and therefore α k β k. On the other hand, vectors Σ(V T w k ) and (U T s k ) display dominance in the same components, see Figure.. If this dominance is strong enough, then the required orthogonality between s k+ and s k, or U T s k+ U T s k in (.34), cannot be achieved without significant cancellation, so we can expect β k+ α k, see [29]. According to (.3), noise is amplified with a ratio much greater than one. Summarizing s noise k+ = α k s noise k = ( ) k ρ k β snoise, (.35) k+ where the cumulative amplification ratio ρ k = k j= α j β j+, (.36) on average rapidly grows as k increases. The values of the normalization coefficients α k, β k+ of the Golub-Kahan iterative bidiagonalization are illustrated in Figure.2. Their cumulative ratio ρ k decreases. Thus ρ k rapidly grows, until it reaches a certain level where it starts to stagnate. 29

30 α k β k estimate ρ k estimate ρ k 5 5 α k β k estimate ρ k estimate ρ k Figure.2: The first twenty normalization coefficients α k, β k+, their cumulated ratio ρ k and its inverse ρ k for problems baart(4) (left) and shaw(4) (right) with the noise level, computed by the Golub-Kahan bidiagonalization with double reorthogonalization. Let us look at several spectral components of s k, following the analysis in [29]. In Figure.3, there are the first 8 absolute values of projections of s k, k =,..., to the left singular vectors u,...,u 4 of the problem baart(4) with the noise level. The first vector s has dominant component in the direction of the left singular vector u and the rest of the projections U T s decay faster than the associated singular values, due to the discrete Picard condition. Then, the vectors s 2,s 3,... have dominant components in the directions of the left singular vectors u 2,u 3,... respectively and the relative level of their high frequency noise components gradually increase. At some point k k noise, in our case k = 7, the following vector s knoise +, actually s 8, has comparable components in practically all subspaces generated by the singular vectors corresponding to the singular values σk 2 noise +,σ2 k noise +2,..., and that reveals noise. In the following vector s knoise +2, in our case s 9, high frequency components slightly decrease, due to the fact that noise is partially projected out. Similar illustration can be seen for the problem shaw(4) with the noise level δ noise = 4 in Figure.4. Since s 8 has comparable components in practically all subspaces generated by the singular vectors corresponding to the singular values σ8,σ 2 9,..., 2 the noise revealing iteration seems to be k noise = 7. 3

31 T U s T U s2 T U s3 T U s4 T U s T U s6 T U s7 T U s8 T U s9 T U s Figure.3: The absolute value of the first 8 spectral components of the vectors s k, k =,...,, for the problem baart(4) with the noise level computed by the Golub-Kahan bidiagonalization with double reorthogonalization. T U s T U s2 T U s3 T U s4 T U s T U s6 T U s7 T U s8 T U s9 T U s Figure.4: The absolute value of the first 7 spectral components of the vectors s k, k =,...,, for the problem shaw(4) with the noise level 4 computed by the Golub-Kahan bidiagonalization with double reorthogonalization. The spectral analysis of the vectors that came from the Golub-Kahan iterative bidiagonalization demonstrated how noise is transfered and amplified during the bidiagonalization process. These results can be used to estimate the noise level of the original data. 3

32 .6 Determination of the Noise Level In this section we further follow theoretical interpretation and results from the paper [29]. As we could see in Section.4, the Golub-Kahan iterative bidiagonalization and the Lanczos tridiagonalization process are closely connected via approximation of the distribution function. This property can be used for determination of the noise level in the data without any additional computation of spectral or any other coefficients of s k. Again repeating, the Lanczos tridiagonalization process (.4) generates at each step k the distribution function ω (k) with the nodes (θ (k) l ) 2 and the weights (p (k) l,e ) 2, l =,...,k, that approximates the distribution function ω with the nodes σn,...,σ 2 2 and weights (b/β,u n ) 2,..., (b/β,u ) 2, see [26, 37, 34, 55]. As illustrated in Figure.5 (left) and also Figure.3 and.7, the weights belonging to the smaller nodes of the distribution function ω are completely polluted by noise. Thus there exists an index J noise such that for j J noise (b/β,u j ) 2 (b noise /β,u j ) 2. Let us define the cumulated weight of the associated nodes as δ 2 n j=j noise (b noise /β,u j ) 2. (.37) (b/β, u j ), j=,..., 5 (b noise /β, u j ), j=,..., 5 7 δ 2 for J noise =,..., Figure.5: The projections (b/β, u j ) 2 and (b noise /β, u j ) 2, j =,...,5 (left) and δ 2 (.37) for J noise =,...,5 (right) for the problem shaw(4) with the noise level δ noise = 4. 32

33 In Figure.5 (right) we can see the behavior of δ 2 as a function of J noise =,...,5. As we can see, the weight of the associated set of nodes for j > J noise decreases very slowly. Since b noise b exact, we can approximately write δ 2 noise = bnoise 2 b exact 2 bnoise 2 b 2 = n (b noise /β,u j ) 2. (.38) For white noise contaminated discrete ill-posed problems we can usually suppose that J noise n yielding j= δ 2 n J noise δnoise 2 δ 2 n noise. (.39) Let us focus again on the distribution functions ω and ω (k). For ω, its larger nodes σ,σ 2 2,... 2 are well separated, relatively to small ones, and their weights (s,u j ) 2 on average decrease faster than the singular values σ,σ 2 2,..., 2 due to the discrete Picard condition. Because of the dominance of weights (s,u j ) 2 corresponding to the large nodes σ,σ 2 2,... 2 (small j) of ω, see Figure.2, the large squared Ritz values (eigenvalues of L k ) (θ (k) k )2, (θ (k) k )2,... (large k) closely approximate σ,σ 2,..., see Section.4 and [26, 37, 55]. As k increases, the smallest nodes σj 2 are approximated by the smallest (θ (k) ) 2. As stated in [9], at any iteration step the weight (p (k),e ) 2 corresponding to the smallest Ritz value (θ (k) ) 2 must be larger than the sum of the weights of all σj 2 smaller than (θ (k) ) 2. At some point, the smallest Ritz value (θ (k) ) 2 approximates, or becomes smaller than, the node σj 2 noise and since δ 2 satisfies (.37), the weight corresponding to (θ (k) ) 2 will become (p (k),e ) 2 δ 2 δ 2 noise. (.4) Then, since δ 2, as a function J noise, (almost) stagnates, the weights corresponding to the following iterations also start to stagnate. And this happens only after all smooth components of s with norms larger than the noise level are damped at the iteration k noise defined in Section.5. In this notation the relation (.4) holds in the iteration k noise +. 33

34 2 ω(λ): nodes σ j, weights (b/β,u j ) 2,j=,...,n ω (k) (k) 2 (k) : nodes (θ ), weights (p,e ) 2,k=,...,5 2 δ noise δ noise (k) (p,e 2 ) ω(λ): nodes σ j, weights (b/β,u j ) 2,j=,...,n ω (k) (k) 2 (k) : nodes (θ ), weights (p,e ) 2,k=,...,5 2 δ noise δ noise (p (k),e ) Figure.6: The smallest value (θ (k) )2 and its weight (p (k), e ) 2 corresponding to the distribution function ω(λ) (left) and the absolute values of the first components (p (k), e ) of the left singular vector of L k corresponding to the smallest singular value θ (k) (right) for the problems baart(4) with the noise level (top) and shaw(4) with the noise level 4 (bottom). In Figure.6 (left figures) there are plots of the distribution function ω and approximations of its smallest nodes and the corresponding weights for the problem baart(4) with the noise level (top) and the problem shaw(4) with the noise level 4 (bottom). As we can see, the weight (p (8),e ) 2 corresponding to the node (θ (8) ) 2 reaches the noise level. The weights obtained in the following iterations are approximately on the same level. The same behavior of the weights, respectively their squared roots (p (k),e ), can be seen on the right plots of Figure.6. After the iteration k = 7, the sequence starts to (almost) stagnate at the level close to the noise level δ noise. This corresponds to the iteration k noise = 7 in Figure.3 and.4. 34

35 Summarizing, we see that it is sufficient to monitor the norm of the first component (p (k),e ) of the left singular vector of the bidiagonal matrix L k, see (.6), which corresponds to the smallest Ritz value θ (k). The point, where it starts to (almost) stagnate, gives the approximation of the noise level, δ noise (p (k noise+),e ). (.4) The last iteration before this happens is called the noise revealing iteration k noise. Such point can be determined by an automated procedure. Such procedures are suggested and studied in Chapter 2. Recall that this estimate of the noise level can be used for the construction of efficient stopping criteria, e.g. based on the Discrepancy Principle, see Section.3. 35

36 Chapter 2 Point of Noise Revealing Only a man s character is the real criterion of worth. Eleanor Roosevelt In this chapter we suggest several approaches to determine the noise revealing iteration in the Golub-Kahan iterative bidiagonalization, in order to estimate the noise level. We look for the point where the sequence (p (k),e ) starts to (almost) stagnate (the graph has a "corner"). We denote this iteration by k stag. Then according to the notation in the previous chapter, the noise revealing iteration k noise is equal to k stag. The following criteria try to determine the iteration k stag automatically. Experiments presented in this chapter were computed with double reorthogonalization. The effects of loosing orthogonality in the Golub-Kahan iterative bidiagonalization will be discussed later in Chapter Stagnation-Based Criterion The first noise revealing iteration criterion was suggested in [29]. As shown in Section.6, the sequence of weights (p (k),e ) 2, after certain iteration step, starts to (almost) stagnate on the level close to the squared noise level δnoise. 2 There is no harm to work only with square roots of the weights, (p (k),e ). Therefore the criterion is based simply on monitoring the point of the sequence (p (k),e ) where it first starts to stagnate. The first iteration 36

37 step k for which the sequence (p (k),e ) satisfies the relation (p (k+),e ) (p (k++step),e ) < ( ) (p (k) ζ,e ), (2.) (p (k+),e ) is considered k stag, [29]. The values step and ζ from (2.) are free parameters. The parameter step controls how far horizontally we look. The parameter ζ stands for a vertical "drop" of values. In [29] they are set to step = 3 and ζ =.5. The implementation in Matlab is straightforward. ilaplace(,) noise level δ noise k stag (p (kstag),e ) baart(4) noise level δ noise k stag 9 8 5(6) 3 (p (kstag),e ) gravity(4) noise level δ noise k stag 3(28) 22(9) 7(6) (p (kstag),e ) shaw(4) noise level δ noise k stag (p (kstag),e ) Table 2.: The approximation of k stag and the estimate of the noise level for the problems ilaplace(,), baart(4), shaw(4) from [2] and gravity(4) from [2] for the noise levels δ noise = 2, 8, 4, 2. Iterations in brackets are iterations where the stagnation exactly starts (determined visually). Let us show on some ill-posed test problems how this criterion works. In Table 2. there are several problems with a couple of noise levels and obtained estimates for the parameters step = 3 and ζ =.5. The estimate of the noise level by the stagnation-based criterion works well, although the noise revealing iteration is not always right. Values of iterations in the brackets stand for the point where the sequence (p (k),e ) starts to (almost) stagnate and they are determined visually just by looking at the sequence. 37

Golub-Kahan iterative bidiagonalization and determining the noise level in the data

Golub-Kahan iterative bidiagonalization and determining the noise level in the data Iveta Hnětynková,, Martin Plešinger,, Zdeněk Strakoš, * Charles University, Prague ** Academy of Sciences of the Czech