THE χ 2 -CURVE METHOD OF PARAMETER ESTIMATION FOR GENERALIZED TIKHONOV REGULARIZATION

Size: px

Start display at page:

Download "THE χ 2 -CURVE METHOD OF PARAMETER ESTIMATION FOR GENERALIZED TIKHONOV REGULARIZATION"

Oliver Brooks
5 years ago
Views:

1 THE χ -CURVE METHOD OF PARAMETER ESTIMATION FOR GENERALIZED TIKHONOV REGULARIZATION JODI MEAD AND ROSEMARY A RENAUT Abstract. We discuss algorithms for estimating least squares regularization parameters based on weighting of a priori information. Theoretical results of Mead (7) are extended for Generalized Tikhonov regularization. Numerical algorithms using Newton s method are designed and evaluated. Results are presented and contrasted with other standard techniques for parameter estimation, including the L-curve, generalized cross validation and unbiased predictive risk esimator approaches. The χ -curve method of parameter estimation is seen to be robust and cost effective in comparison to other methods. Key words. Least Squares, Regularization, Maximum likelihood estimation AMS subject classifications. 5A9,5A9, 6F5, 6G8. Introduction. Consider the ill-posed system of equations Ax = b, A R m n, b R m, x R n for which an approximate solution is to be obtained by solving the regularized weighted least squares problem (.) ˆx = argmin J(x) = argmin{ Ax b W b + x x W x }. Here the weighted norm is y W = yt W y, for general vector y and weighting matrix W. If the matrices W b and W x are inverse covariance matrices for normally distributed data b and model parameters x, respectively, ˆx is the maximum likelihood estimate. In [9] Mead examined this regularized weighted least squares problem when b and x are not necessarily normally distributed and introduced a method whereby given W b or W x, the other can be estimated. The algorithm relies on the following result: THEOREM. ([, 3, 9]). Let J(x) = (b Ax) T W b (b Ax) + (x x ) T W x (x x ), where the elements in x and the elements in b are each independent and identically distributed. If W b and W x are symmetric positive definite (SPD) weighting matrices on the data and model errors, resp., then for large m, the minimium value of J is a random variable which follows a χ distribution with m degrees of freedom. When the hypotheses of the Theorem are satisfied, given particular choices for W b and Supported by NSF grant EPS , Boise State University, Department of Mathematicss, Boise, ID Tel: , Fax: jmead@boisestate.edu Supported by NSF grants DMS 534 and DMS Arizona State University, Department of Mathematics and Statistics, Tempe, AZ Tel: , Fax: renaut@asu.edu

2 W x, the difference J(ˆx) m is an estimate of confidence that W b and W x are correct. If the difference is small, we have no reason to reject these choices, while if it is large, less confidence can be placed on the choices of W b and W x, resp. In particular, if W b is known, but J is too small relative to m, then most likely the error covariance matrix C x = W x overestimated, [3]. This suggests that if one weighting matrix W x = Cx, or W b = C b, is known, the other can be found by requiring that J, to within a specified ( α) confidence interval, is a χ random variable with m degrees of freedom. Specifically, the appropriate weighting matrix should be found such that is (.) m z α/ < J(ˆx) < m + z α/ m z α/ < r T (AC x A T + C b ) r < m + z α/, where z α/ is the relevant z-value, r = b Ax and (.3) ˆx = x + C x A T (AC x A T + C b ) r. Furthermore, given estimates of W b and W x satisfying (.) for small α, the posterior probability density for x, P(x), is a Gaussian probability density such that (.4) P(x) = constant exp( (x ˆx)T Wx (x ˆx)), where W x is the inverse posterior covariance probability density. In this case, it can be shown that (.5) W x = A T W b A + W x, Cx = (A T W b A + W x ) [3]. This identification of the posterior convariance is then useful in assigning uncertainty bars to the posterior values of ˆx, particularly when W x is diagonal. Although our main focus here is on finding W x, given a reliable estimate of W b, we present some results in Section 5. in which we find W b, given W x. The algorithm is also presented in the Appendix. For C x SPD, the Choleski factorization C x = L x L T x exists, where L x is lower triangular and invertible. Mead solved the problem of finding L x through a nonlinear minimization. ALGORITHM ([9]). Given confidence interval parameter α, initial residual r = b Ax and estimate of the data covariance C b, find L x which solves the nonlinear optimization. Minimize L x L T x F Subject to m z α/ < r T (AL x L T x AT + C b ) r < m + z α/ AL x L T x AT + C b well-conditioned. The use of the singular value decomposition facilitates the development of a more practical characterization for obtaining an appropriate SPD weighting matrix.

3 LEMMA.. Let UΣV T be the singular value decomposition (SVD) of L A, [3], with b 3 singular values σ σ... σ p, where p min{m, n} is the rank of matrix A. Then with C x = L x L T x (.) is replaced by (.6) m z α/ < s T P s < m + z α/, (.7) P = ΣV T L x L T x V Σ T + I m, s = U T L b r, and the posterior covariance matrix is given by (.8) C x = L x (L T x V ΣT ΣV T L x + I) L T x. Proof. This is immediate by observing AL x L T x AT + L b L T b = L b(l b AL xl T x AT (L T b ) + I m )L T b = L b (UΣV T L x L T x V Σ T U T + UU T )L T b and r T (AL x L T x AT + L b L T b ) r = r T (U T L T b ) (ΣV T L x L T x V ΣT + I m ) (L b U) r = s T (ΣV T L x L T x V ΣT + I m ) s. The result for the covariance matrix follows similiarly. While it is now apparent that the solution of the the eigenvalue equation (.9) P s = γs, γ = s /m, is a solution of (.), the result does not necessarily hold in both directions. When C x depends only on the variable σx, from Lemma., there are insufficient degrees of freedom to solve (.9) such that (/m, s/norm(s)) is an eigenpair for P. But, for more general C x, (.) is a severely underdetermined nonlinear equation, and may only be solved by imposing some additional constraints. In this case, it may be reasonable to find a solution via the eigenvalue equation (.9). Suppose, for example, that C x is diagonal and n < m, then (.9) defines an overdetermined system of equations which can be solved by the method of nonlinear least squares. This will be a topic of future research. Our focus here is on the development of a new practical algorithm for obtaining a single regularization parameter. In Section we discuss the development of an algorithm based on the formulation discussed in [9]. Then in Section 3 we extend the result of Theorem.. This result provides a more general algorithm, which is discussed in detail for the single variable case and contrasted with other methods for determining the single regularization

4 4 parameter, including the L-curve, generalized cross validation (GCV) and unbiased predictive risk estimator (UPRE), [4, 5]. Numerical results are presented in Section 5. Future work and conclusions are discussed in Section 6. The algorithms are detailed in the Appendix.. Single Variable Case. For the special case C x = σx I n, the matrix P in (.7) simplifies and (.6) becomes m z α/ < s T diag( σi σ x + )s < m + z α/, (σ i =, i = p +... m). Hence Newton s method can be used to find the root of σ x such that (.) F (σ x ) = s T diag( + σx )s m =. σ i With m = m m i=p+ s i and (.) we immediately obtain s i = { s i σ x σ i + i =... p i > p, (.3) F (σ x ) = s T s m and (.4) F (σ x ) = σ x t, t i = σ i s i. Newton s method is then (.5) σ new x = σ x + σ x t (s T s m). At convergence (.6) (.7) ˆx = x + V t, t = σ t and C x = σ (V diag( σ σi +, I n p)v T ). Clearly F (σ x ) is even, and monotonic decreasing for σ x >. Moreover, lim σx F =, and hence any positive solution of F (σ x ) = is unique. Additionally, if F () = s m < no solution exists. ALGORITHM (Find σ x ). Given tolerance ɛ, initial residual r = b Ax, initial σ, and an estimate of SPD weighting matrix C b (the data covariance), find L x = σ x I n by Newton s method.

5 5 Initialization: Do Until Converged Cholesky factorization C b = L b L T b Form SVD UΣV T = L b A. Solve L b v = r Form s = U T v. Calculate F () = s m. If F < Stop with error note. Calculate m = m m i=p+ s i Set s = m and σ i =, i > p. (Or max steps reached) s s i = ( i σ σi +), i =,..., p, F = st s m Convergence Test : F ɛ t = diag(σ i ) s σ = σ + F σ t End (Do) Calculate t = σ t Solution of (.): ˆx = x + V t Solution σ x = σ. Covariance Cx = σ (V diag( σ σ i +, I n p)v T ) Notice that at convergence (or termination of the algorithm) F (σ x ) = ɛ, ɛ = ()z α/, for some ɛ. Equivalently, F = zα/ and the confidence interval ( α) can be calculated. The larger the value of ɛ, the larger the ( α) confidence interval and the greater the chance that any random choice of σ x will allow J(ˆx) to fall in the appropriate interval. Equivalently, if ɛ is large, we have less confidence that σ x is a good estimate of the standard deviation of the error in x... Estimating the data error. If C x is given and C b is to be found we have that (.8) m z α/ < r T (UΣΣ T U T + L b L T b ) r < m + z α/, where now UΣV T is the SVD of AL x. When L b = σ b I m, let (.9) F (σ b ) = s T diag( σi + σ b ))s m, s = U T r, where σ i are the singular values of the matrix AL x, σ i =, i > p. Then σ b can be found by applying a similar Newton s iteration to find the root of F =, see the Appendix. 3. Extension to Generalized Tikhonov Regularization. In many applications no information about the covariance structure of x is either assumed or included, and (.) is solved with (x x ) T W x (x x ) replaced by λd(x x ). D is typically chosen to

6 6 yield approximations to the l th order derivative, l =,,. Except in the l = case, the matrix D R (n l) n is necessarily not of full rank, but the invertibility condition (3.) N (A) N (D), is assumed, where here N (A) is the null space of matrix A. Consider, then, the generalization of (.) to include the regularization with weighting (3.) ˆx GTik = argminj D (x) = argmin{ Ax b W b + D(x x ) W D }, which permits weighting the error of D(x x ), where D R p n, p n, and in this case W D R p p. This leads to the generalization of Theorem. for functional J D, for which the proof uses the generalized singular value decomposition (GSVD). LEMMA 3.. [3] Assume the invertibility condition (3.) and m n p. There exist unitary matrices U R m m, V R p p, and a nonsingular matrix X R n n such that [ ] Υ (3.3) A = U X T D = V [M, p (n p) ]X T, (m n) n where and such that Υ = diag(υ,..., υ p,,..., ) R n n, M = diag(µ,..., µ p ) R p p, (3.4) υ... υ p, µ... µ p >, υi + µ i =, i =,... p. THEOREM 3.. Under the same conditions as Theorem. along with the invertibility condition (3.), then for large m, the minimium value of J D is a random variable which follows a χ distribution with m n + p degrees of freedom. Proof. The solution of (3.) is given by the solution of the normal equations (3.5) (3.6) ˆx GTik = x + (A T W b A + D T W D D) A T W b r = R(W D )W / r, where R(W D ) = (A T W b A + D T W D D) A T W / b, r = b Ax. b The minimum value of (3.) is J D (ˆx GTik ) = r T (ARW / b I) T W b (ARW / b I)r + r T (DRW / b )T W D (DRW / b )r (3.7) = r T W / b (I m W / b AR(W D))W / b r = r T W / b (I m A(W D ))W / b r, A(W D) = W / b AR(W D)

7 where here A(W D ) is the influence matrix [5]. Using the GSVD for the matrix pair [Ã, D], where Ã = W / b A and D = W / D D, to simplify matrix A(W D) it can be seen that Therefore I m A(W D ) = I m UΥ U T = U(I m Υ )U T. J D (ˆx GTik ) = r T W / b U(I m Υ )U T W / b r p m = µ i s i + s i, s = U T W / b r, i= = k. i=n+ Without loss of generality we assume that s is such that J D 7 >. k is a vector of length m n + p > whose components are normally distributed when errors in the components of the data b are normally distributed. On the other hand, as in [9], if the errors are not normally distributed, then the central limit theorem applies to show that as m approaches infinity J D is a χ random variable with m n + p degrees of freedom. Clearly, Theorem 3. generalizes Theorem.. When matrix D is full rank, with the weighting matrix in Theorem. replaced by W x = D T W D D, the two statements are equivalent, and the above proof is an alternative proof for Theorem.. Observe, in particular, for D a full rank matrix, the covariance of Dx is given by C D = DC x D T, W D = (D T ) W x D, and D T W D D = W x, so that weighted regularization with a full rank matrix is indeed equivalent to the original case, and the weighting matrix W D depends on the mapped model parameters and not on the actual parameters. To obtain the optimal weighting matrix W D for the generalized Tikhonov regularization, as in (.), W D has to be found such that m n + p z α/ < r T W / b (I m A(W D ))W / b r < m n + p + z α/. Again, assuming that W D has been found such that confidence in the hypotheses of the theorem is high, the posterior probability density for x, is given by (.4) with (3.8) W D = A T W b A + D T W D D, Cx = (A T W b A + D T W D D). 3.. Single Variable Case. Let W D = σx I p, then J D (ˆx GTik ) = r T W / b (I m Ã C x Ã T )W / b r, Cx = (A T W b A + σx DT D)

8 8 (3.9) = r T W / b U = = p ( i= I p Υ (Υ + σx M ) I m n υ i υi + σ x µ i p ( σx γ i + )s i + i= )s i + m m i=n+ i=n+ s i, s i, γ i = υ i µ i U T W / b r where here the GSVD is for the pair [Ã, D], and γ i are the generalized singular values of [Ã, D]. As in Section, some algebra can be applied to reduce the computation for the calculation of F and F. Define m = m n + p p i= s i δ γ i m i=n+ s i, and let s be the vector of length m with zero entries except s i = s i /(γ i σ x + ), i =,..., p, for γ i, and let t i = s i γ i. Then (3.) F (σ x ) = s T s m and F (σ x ) = σ x t. The algorithm is the same as before but with σ i replaced everywhere by γ i in the appropriate definitions of the variables. Additionally, (3.) (3.) ˆx GTik = x + (X T ) t, t i = { σ xt i /µ i i =... p s i i = p +... n. σ C x = ((X T ) diag( σ νi +, I n p )X ). µ i ALGORITHM 3 (Find σ x ). Given tolerance ɛ, initial σ, and estimate of C b, find L x = σ x I p by Newton s method. Initialization: Do Until Converged Cholesky factorization C b = L b L T b Form GSVD (3.3) for matrix pair [L b A, D]. Solve L b v = r Form s = U T v. Calculate F () = s( : p, n + : m) (m n + p). If F < Stop with error note. Calculate m = m n + p p i= s i δ γ i m i=n+ s i Set s = m, γ i =, i > p. (Or max steps reached) Convergence Test : F ɛ t = diag(γ i ) s σ = σ + F σ t End (Do) Solution of (3.): Set t i = σ t i /µ i, i =... p, t i = s i, i = p +... n. Solve to give ˆx GTik = x + (X T ) t. Solution σ x = σ. Covariance Cx = ((X T ) σ diag( σ νi +µ i (Calculate diagonal entries only)., I n p )X ).

9 9 This algorithm can be modified, as for the SVD case, to handle the situation in which information on the weighting of the operator Dx is known, and C b is to be estimated, details are provided in the Appendix. Note, also, that when D is nonsingular and square, the GSVD of the matrix pair [A, D], corresponds to the SVD of matrix AD, with the singular values now ordered in the opposite direction. If either A or D is ill-conditioned the calculation of AD can be unstable, leading to contamination of the results, and hence the algorithm should in general be implemented using the GSVD. 4. Other Parameter Estimation Techniques: L-curve, GCV and UPRE. Algorithms and 3 provide an alternative approach to techniques such as the L-curve, GCV and UPRE for estimating the single variable regularization parameter, see for example [6, 4, 5]. In terms of the GSVD and using the notation of Section 3., the GCV function to be minimized is given by (4.) b C(σ) = Ãx(σ) [trace(i m A(W D ))], W D = σ I n m i=n+ = s i + p i= (δ γ is i + s i ) (m n + ( p, i= γi σ + )) [4]. The UPRE seeks to minimize the expected value of the predictive risk by finding the minimum of the UPRE function (4.) U(σ) = b Ãx(σ) + trace(a(w D)) m m p p = ( s i + (δ γis i + s i )) + (n γi σ + ) m, i=n+ i= i= where note that the variance of the model error is explicitly included within the weighted residual, [5]. In contrast, the L-curve approach seeks to find the corner point of the plot, on log-log scale, of Dx(σ) against Ãx(σ) b. The advantages and disadvantages of these approaches are well noted in the literature. The L-curve does not yield optimal results for the weighted case C b I, while the GCV and UPRE functions may be nearly flat for the optimal choice of σ, and/or have multiple minima, which thus leads to difficulty in finding the optimal argument, [4, 6]. On the other hand, the discrepancy principle, which can be implemented by a Newton method, [5], and does use the knowledge of the white noise in the data, finds σ x such that the regularized residual satisfies (4.3) σ b = m b Ax(σ).

10 Consistent with our notation this becomes the requirement that (4.4) p ( γi σ + ) s i + i= m i=n+ s i = m, similar to (3.9), but note that the weight in the first sum is squared in this case. In each of these cases, the algorithms rely on the calculation of the GSVD ( resp. SVD) for evaluating the relevant functions (4., 4.), or the L-curve. Yet, for the GCV and UPRE function, the minimization cannot be performed efficiently until an interval containing the minimum has been identified. Thus, as for calculating the L-curve, the methods typically proceed by calculating each function for a range of parameter values, on logarithmic scale, and then after isolating a potential region for a minimum, the minimum is found within that range of parameter values. Effectively, in each case the function is found for at least different choices of the parameter value, [4], in contrast to the Newton method described in Sections, 3., which should converge with very few function evaluations. Hence while the cost of the proposed algorithm is dominated by that for obtaining the GSVD (resp. SVD), this cost is the same as that for initializing the other standard methods for parameter estimation, while in contrast the optimal parameter is found with minimal additional cost. 5. Experiments. We present a series of representative results using benchmark cases, phillips, and blur, from the Regularization Tool Box [4] in order to contrast the χ Newton method with well-known but illustrative problems. In addition, we present the results for a real model from hydrology. These experiments contrast the results presented in [9], in which Algorithm was used with more general choices for the weighting matrix W x. For the benchmark problems, we add noise to the right hand side b using the Matlab R [7] function randn, (resp. exprnd), to generate normally distributed, (resp. exponentially distributed), noise in b, with standard deviation σ bi = a b i +cb max, where a, c are tolerances to determine the size of the variance in b, and b max = max b i. The covariance matrix C b can therefore be given by diag(σb i ), or to simulate the case with white noise σb is taken as the average of the σb i and C b = σb I m. These choices are denoted by variable cb =,, 3, for case I, σb I and the diagonal case, respectively. In the experiments we use a =., c =., which yields errors in b illustrated for one random case in Figure 5.. For those cases where we use a reference solution x it is generated by taking the exact known solution, and adding noise in the same way as for modifying b, so that each component is assumed to have a different noise variance. We also illustrate results for different choices of the generalized Tikhonov regularization with D the zeroth, first and second order derivative operator, as indicated by

11 6 Example Phillips size n= (a) (b) FIG. 5.. Right hand side of problem phillips in (a) with normally distributed noise using a =., c =. and for problem size n = 4. In (b) the exact solution and and blurred noisy data for problem blur with n = 6. l =,,. First we show sample curves used by the L-curve, GCV, UPRE and χ -curve methods to find the regularization parameter W x. In Figure 5. the top row is the Tikhonov method and the bottom row is generalized Tikhonov with first derivative operator. This is for the problem phillips and C b is chosen diagonal. The curves use loglog scale, except for χ, which uses log scale for the x-axis only. One sees from these curves the characteristic corner of the L-curve, although is not always so apparent, and the characteristics of the GCV and UPRE curves, both of which potentially include multiple minima and/or regions where the curves are flat. These properties make the determination of the desired parameter difficult. On the other hand, the monotonicity of the χ -curve facilitates the Newton method. The χ -curve produces generally a smaller regularization parameter than the GCV and L-curve algorithms which is mostly comparable to that of the UPRE algorithm. Sample solutions obtained for phillips with all three choices of C b, with each method for Tikhonov regularization, D = I, are illustrated in Figure 5.3, with the error bars shown. Note that the reference solution x = is assumed. These solutions demonstrate the overestimation of the data variance when the confidence interval is wide, and the oversmoothing of the solution using the parameter obtained from the χ -curve when no statistical information is provided. In Figure 5.4 sample solutions for each of the phillips cases using generalized Tikhonov regularization, D the first order derivative operator, are illustrated. It can be observed, that the actual solutions obtained from the χ -curve are comparable to those with UPRE and GCV algorithms, see Figure 5.4(f-h) and (j-l). For the problem blur, for which reasonable solutions are only obtained with larger n, we illustrate the solution for C b = σb I and derivative order, Figure 5.5. These solutions show

12 4 L curve GCV function solution norm D x G(λ) U(σ) residual norm A x b Wb 4 3 λ σ e (a).7863 (b).766 (c).6997 (d) L curve GCV function solution norm D x G(λ) U(σ) residual norm A x b Wb 4 λ σ e (e).7766 (f). (g).949 (h).53 FIG. 5.. Curves used for phillips: (a), (b), (c) and (d) are the L, GCV, UPRE and χ curves, respectively, for the Tikhonov problem of size n = 4 and diagonal C b. The bottom group is for the generalized Tikhonov problem. The optimal parameter in each case is given in the caption of each graph. The method uses x =. that the statistical information is crucial in obtaining some reasonable estimate of deblurring; the L-curve results are far too noisy. Indeed the solutions given are for two different noise levels, 5% and %, and it can be seen that while the L-curve is sensitive to the change in noise level, the other algorithms perform comparably to each other. We can see from these sample solutions that the new method is as effective as both UPRE and GCV when statistical information on the data is provided, and from the error bar plots it is also clear that the L-curve results, which include no statistical information, yield solutions with much less certainty in terms of the posterior covariance matrices. Indeed, the results presented here support the premise, reported elsewhere, that the L-curve is not the method of choice when statistical information is available. The ability of the Newton based χ -method to yield solutions comparable to that of either UPRE or GCV suggests that it, and not GCV or UPRE, is the method of choice when statistical information is available. In particular, the Newton method converges quadratically, and in the examples considered convergence is achieved in on average 5- steps, Tables These results also support the potential for lack of success of the χ method when no statistical information is available. In particular the average values of the obtained standard deviation for problem blur are negative corresponding to cases where there is no positive solution of F =.

13 3.5 L curve.5 GCV curve UPRE curve. χ method (a) C b = I (b) C b = I (c) C b = I (d) C b = I L curve GCV curve UPRE curve χ method (e) C b = σ b I (f) C b = σ b I (g) C b = σ b I (h) C b = σ b I L curve GCV curve UPRE curve χ method (i) C b = diag(σ b i ) (j) C b = diag(σ b i ) (k) C b = diag(σ b i ) (l) C b = diag(σ b i ) FIG Tikhonov solutions of phillips: (a), (b), (c) and (d) for problem size n = 4 and C b = I. The middle group uses C b = σ b I, (e)-(g), and the lower group uses the diagonal C b. The method uses x =. Tables 5.3 and 5.4 report the norm of the actual relative error and weighted predictive risk, m Ã(x GTik x), over 5 trials of phillips with n = 6, for the χ, L-curve, GCV and UPRE methods, respectively, with normally distributed and exponentially distributed noise. The first two columns in each case are the order of the derivative, and the choice of the covariance matrix C b. In these results a fixed reference value x, obtained from the exact x with % noise added, is specified.the first block in each case gives the calculated means and standard deviations over all solutions for the relative errors, while the second block reports the same data, but with the mean and standard deviation calculated for all solutions with relative errors less than.9 and.5 for normally and exponentially distributed noise, resp. This second block is presented in each case so as to illustrate the performance for acceptable

14 4.5 L curve.5 GCV curve.4 UPRE curve. χ method (a) C b = I (b) C b = I (c) C b = I (d) C b = I L curve.4 GCV curve.4 UPRE curve.4 χ method (e) C b = σ b I (f) C b = σ b I (g) C b = σ b I (h) C b = σ b I L curve.4 GCV curve.4 UPRE curve.4 χ method (i) C b = diag(σ b i ) (j) C b = diag(σ b i ) (k) C b = diag(σ b i ) (l) C b = diag(σ b i ) FIG Generalized Tikhonov solutions for phillips: (a), (b), (c) and (d) for problem size n = 4 and C b = I. The middle group uses C b = σ b I, (e)-(g), and the lower group uses the diagonal C b. The method uses x =. solutions. For all methods there are occasions in which the apparently optimally determined parameter leads to totally oversmoothed, or unrealistic solutions, which are immediately obvious as unrealistic by visual examination. Examples of the noise distribution in the errors, with the unrealistic solutions removed, are illustrated in Figure 5.6. These histograms are for the case using the first derivative operator and W b = σ I and the caption triples give b mean, standard deviation, and not counted solutions in each case. They demonstrate that the statistically based methods (GCV, UPRE and χ ) generate very tight distributions for the acceptable solutions, and that all methods lead to some unrealistic solutions. The last block in the tables reports the risk over all solutions. Apparently, even totally unrealistic solutions minimize the predictive risk. GCV and UPRE which are designed to minimize the predictive

15 5 L curve GCV UPRE 3.68 χ (a) L curve GCV UPRE 8.86 χ (b) FIG Solution of blur with reference solution x =. Normally distributed noise is added using a =., and in (a) c =..5 and in (b) c =.. Problem size n = 6. The generalized Tikhonov with the second derivative operator is used. Here the titles to each figure indicate the method used and the resulting norm of the error. risk, do indeed minimize the predictive risk but not significantly better than either L or χ curve methods. The χ and UPRE methods give the best minimization of the actual relative error, but the average error for all acceptable solutions is very similar across all methods.

16 6 l cb Iterations k σ F mean std mean std mean std 8.3e e 4.95e.98e 7.34e 8.73e 7 8.3e + 9.8e 3.69e +.7e e 8.88e e +.6e +.74e e e 8.9e 7 4.9e + 5.e 6.66e 6.98e 3 8.4e 8.9e 7.e +.6e +.53e e e 8.98e 7 3.e +.9e +.59e e e 8.99e 7 5.e + 8.9e 8.77e 3.8e 3 6.6e 8.65e 7 8.9e +.48e e +.56e + 7.3e 8.74e e +.5e e +.9e + 7.6e 8.75e 7 TABLE 5. Convergence characteristics for problem phillips with n = 4 over 5 runs l cb Iterations k σ F mean std mean std mean std 6.84e +.8e +.43e.4e 8.8e.56e + 8.8e +.36e +.75e +.6e 7.84e 8.84e e +.46e +.75e +.67e 7.67e 8.77e 7 6.5e +.3e +.57e e 4.e +.74e + 7.4e e.78e + 3.4e 8.3e 8.93e e + 8.e.77e +.89e 7.94e 8.93e 7 6.e +.4e +.53e 4.49e 3.e +.5e + 7.8e + 8.e.4e + 6.6e 7.8e 8.83e e e.3e e 8.34e 8.89e 7 TABLE 5. Convergence characteristics for problem blur with n = 36 over 5 runs

17 Error l cb χ L GCV UPRE mean std mean std mean std mean std 3.3e 9.9e 5.56e +.3e +.7e + 8.e + 6.9e 3 3.3e e.5e 5.9e +.3e +.e + 8.e +.65e 3.6e 5.4e.7e 3.98e + 9.7e e 7.e e 3 3.e 3 5.e.7e 4.6e +.e +.75e + 9.9e +.53e.8e 5.45e.e.85e + 6.4e +.87e +.e +.4e.3e e.e.78e + 6.3e +.57e + 9.3e +.9e.3e Errors with all errors greater than.9 removed 4.37e 3 6.9e e 3 6.5e 4 4.e 3 4.e 4 4.e 3 4.4e e 3 5.9e 4 4.4e 3 6.3e 4 4.e 3 4.e 4 4.e 3 4.e e 3 5.e 4 5.7e 3.e 3 4.3e 3 4.3e 4 4.3e 3 4.5e e 3 6.3e 4 5.5e 3.e e 3 5.7e e 3 6.e 4 4.5e 3 7.7e e 3.3e e 3 4.8e e 3 7.e e 3 4.7e e 3.3e e 3 5.4e e 3 6.e 4 Weighted Risk 3.78e.4e 5.e 3.6e 3.5e.e.9e.e e.4e 5.e 3.6e.97e.e.9e.3e 3.94e.6e 5.7e 3.e 3.e.e.74e.6e 3.e.6e + 5.9e 3.e 3.7e.5e.79e.6e 3.4e.9e 6.e.8e 3.35e.3e 3.79e.5e 3 3.6e.9e 5.98e.8e 3.35e.3e 3.8e.5e TABLE 5.3 Error characteristics for problem phillips with n = 6 over 5 runs for normally distributed noise, and the estimate of x obtained by adding % componentwise normally distributed noise. The value of x is fixed for all runs. 7

18 8 Error l cb χ L GCV UPRE mean std mean std mean std mean std.84e.e e 5.3e +.63e + 9.9e e 5.8e + 3.9e 9.7e 6.7e 4.6e +.56e + 9.7e + 6.3e 6.e +.47e 8.3e 6.38e 4.e +.43e + 8.7e + 8.8e 8.5e + 3.8e 8.e 6.56e 4.4e +.38e + 8.4e e 3.e e.6e + 4.4e.7e +.e + 8.3e e 7.4e e.e e 3.e +.9e +.e + 6.9e 5.8e + Errors with all errors greater than.5 removed 4.37e 3 6.9e e 3 6.5e 4 4.e 3 4.e 4 4.e 3 4.4e e 3 5.9e 4 4.4e 3 6.3e 4 4.e 3 4.e 4 4.e 3 4.e e 3 5.e 4 5.7e 3.e 3 4.3e 3 4.3e 4 4.3e 3 4.5e e 3 6.3e 4 5.5e 3.e e 3 5.7e e 3 6.e 4 4.5e 3 7.7e e 3.3e e 3 4.8e e 3 7.e e 3 4.7e e 3.3e e 3 5.4e e 3 6.e 4 Weighted Risk.e 4.6e.e.8e.4e.e.4e.9e 3.e 4.3e.e.7e.3e.e.e.6e.36e.4e.35e.9e.34e.e.34e.4e 3.35e.4e.34e.e.33e.e.33e.4e.36e.5e.36e.9e.34e.e.36e.4e 3.35e.4e.36e.9e.34e.e.36e.3e TABLE 5.4 Error characteristics for problem phillips with n = 6 over 5 runs for exponentially distributed noise, and the estimate of x obtained by adding % componentwise normally distributed noise. The value of x is fixed for all runs.

19 9 3 Chi (.43,.5, 36) L c (.5,., 7) x 3 GCV (.43,.4, 47) x 3 UPR (.43,.5, 34) x 3 Chi (.47,., 5).5..5 GCV (.49,., 47) x 3 5 L c (.57,.8, 3) x 3 UPR (.48,.9, 56) FIG Histograms for the relative error in solving problem phillips over 5 runs, with unrealistic solutions removed. The first four plots are for normally distributed noise, and the last four for exponentially distributed noise, in the right hand side vector b. The generalized Tikhonov with the first derivative operator is used, and weighting W b = σ b I.

20 5.. Estimating data error: Example from Hydrology. The χ -curve method will be used to find weights σ b on field measurements b, given initial parameter estimates x and C x based on laboratory measurements. Not only do we get parameter estimates which optimally combine field and laboratory data, but also an estimate of field measurement error, i.e. σ b. In nature, the coupling of terrestrial and atmospheric systems happens through soil moisture. Water must pass through soil on its way to groundwater and streams, and soil moisture feeds back to the atmosphere via evapotranspiration. Consequently, methods to quantify the movement of water through unsaturated soils are essential at all hydrologic scales. Traditionally, soil moisture movement is simulated using Richards [] equation for unsaturated flow: (5.) θ t = (K h) + K z. θ is the volumetric moisture content, h pressure head, and K(h) hydraulic conductivity. Solution of Richards equation requires reliable inputs for θ(h) and K(h). These relationships are collectively referred to as soil-moisture characteristic curves, and they are typically highly nonlinear functions. These curves are commonly parameterized using the van Genuchten [4] and Mualem [] relationships: (5.) θ(h) = θ r + (θ s θ r ) ( + αh n ) m h < ( K(h) = K s θe l ( θe /m ) m) θ e = θ(h) θ r θ s θ r. h < These equations contain seven independent parameters: θ r and θ s are the residual and saturated water contents (cm 3 cm 3 ), respectively, α (cm ), n and m (commonly set = /n) are empirical fitting parameters (-), K s is the saturated hydraulic conductivity (cm/sec) and l is the pore connectivity parameter (-). We focus on parameter estimates for θ(h) to be used in Richards equation. Field studies and laboratory measurements of soil moisture content and pressure head in the Dry Creek catchment near Boise, Idaho are used to obtain θ(h) estimates. This watershed is typical of small watersheds in the Idaho Batholith and hydrologic studies have been conducted there since 998 under grants from the NASA Land Surface Hydrology program and the USDA National Research Initiative, [8]. Detailed hill slope and small-catchment studies have been ongoing in two locations at low and intermediate elevations. Measurements used here to test the χ -curve method are from the intermediate elevation small catchment which

21 3 5 Soil Moisture Measurements from the Field and the Lab Field Measurements Laboratory Curve h θ(h) FIG Soil water retention curves, θ(h), observed in the field, and in the laboratory. The laboratory curve gives initial parameter estimates, while field measurements are the data. drains. km. The north facing slope is currently instrumented with a weather station, two soil pits recording temperature and moisture content at four depths, four additional soil pits recording moisture content and pressure head with TDR/tensiometer pairs. We will show soil water retention curves from one of these pits. Laboratory measurements were made on undisturbed samples (approximately 54 mm in diameter and 3 mm long) taken from the field soil pits at the same depths, but up to 5 cm away from the location of the in-situ measurements so as not to influence subsequent measurements. These samples were subjected to multi-step outflow tests in the Laboratory []. The curves that fit the laboratory data do not necessarily reflect what is observed in the field, see Figure 5.7. The goal is to obtain parameter estimates x which contain soil moisture and pressure head information from both laboratory and field measurements, within their uncertainty ranges. We rely on the laboratory measurements for good first estimates of the parameters x = [θ r, θ s, α, n, m] and their standard deviations σ xi. It takes -3 weeks to obtain one set of laboratory measurements, but this proceedure is done multiple times from which we obtain standard deviation estimates and form C x = diag(σθ r, σθ s, σα, σn, σm). These standard deviations will account for spatial variability, and measurement technique or error. However, measurements on this core may not accurately reflect soils in entire watershed region.

22 In order to include what is observed in the field, the initial parameter estimates from the laboratory, x, are updated with measurements collected in-situ. However, we also cannot entirely rely on the in-situ data because it contains error due to incomplete saturation, measurement technqiue and error. This means we must also specify a data error weight C b a priori. It is not possible to obtain repeated measurements of field data, and get uncertainty estimates as was done with laboratory cores. We instead estimate C b by using the χ -curve method. This requires a linear model, and we use the following technique to simulate the nonlinear behavior of (5.). Let (5.3) (5.4) θ(h, x) θ(h, x ) + θ x (x x ) x=x = A(h, x )x + q(h, x ). Then b i = θ(h i, x) q(h i, x ) is of dimension m, while A has dimension m 5. An optimal ˆx is found by the χ -curve method, x is updated with it, and (5.4) is iterated. The results are shown in Figure 5.8 where we plot the initial estimate x and field measurements with their standard deviations, along with the final curve found by iterating the χ -curve method. The final estimate ˆx converged in three iterations on the linear model (5.4). The standard deviation for x was specified a priori while the standard deviation for the data was calculated with the χ -curve method, and is estimated to be.775. We note that the large standard deviation on x observed in the laboratory, resulted in optimal estimates ˆx which more closely resemble what was observed in the field. The value of the χ variable in this example exactly matched the number of data: Conclusions. A cost effective algorithm for finding the regularization parameter for the solution of regularized least squares problems has been presented, when a priori information on either model or data error is available. The algorithm offers a significant advantage as compared to other statistically based approaches because it determines the unique root, when the root exists, of a nonlinear monotonic decreasing scalar function by Newton s method. Consistent with the quadratic convergence properties of Newton s method, the algorithm converges very quickly, typically in no more than ten steps. Compared with the UPRE or GCV approaches this offers a considerable cost advantage, and avoids the problem with the multiple minima of the GCV and UPRE. In accord with the comparison of statisticallybased methods discussed in [5], we conclude that a statistically-based method should be used whenever such information is available, and that the method of choice should actually

23 3 3 5 Soil Moisture Estimates from χ curve with Measurements from Field and Lab data (Field) x (Lab) ˆx h θ(h) FIG The curve resulting from the value of ˆx found by the χ -curve method optimally combines field and laboratory measurements, within their standard deviation ranges. be this new Newton χ -curve method. These positive results are encouraging for the extension of the method,as suggested in [9], for more general covariance structure of the data error, which will be the subject of future research. 7. Acknowledgements. We thank Jim McNamara, Department of Geosciences, and Molly Gribb, Department of Civil Engineering, at Boise State University for providing the in-situ and laboratory measurements. Appendix. As previously noted in Sections., 3. the algorithms can be modified to handle the case that C x is given and C b is to be found. From (.9) F (σ) = s T s m, s i = F (σ) = σ s s i σ i + σ, σ i =, i > p ˆx = x + L x V t, t i = σ + σi and C x = σ L x V diag( σi + )V T L T σ x. σ i

24 4 ALGORITHM 4 (Find σ b ). Given tolerance ɛ, initial residual r = b Ax, initial σ, and Initialization: Cholesky factorization C x = L x L T x Form SVD UΣV T = AL x. Form s = U T r. Calculate F () = s m. If F < Stop with error n Set σ i =, i > p. Do Until Converged (Or max steps reached) s s i = ( i ), i =,..., m, F = s T s m σ estimate of C x, find L b = σ b I m by Newton s method. +σi Convergence Test : F ɛ σ = σ + F σ s End (Do) Calculate t i = σi σ +σi Solution of (.): ˆx = x + L x V t Solution σ b = σ. Covariance Cx = σ L x V diag( σ i +σ )V T L T x For the GSVD case the equivalent results are obtained with s = U T r, γ i =, i > p, and m = m n + p. s i F (σ) = s T s m s i = σ + γi, i =... m, γ i, s i =, otherwise, F (σ) = σ s ˆx = x + (X T ) t, t i = γ i s i, i =... p, s i, i = p +... n, and otherwise µ i C x = σ (X T ) (diag( νi + σ µ ), I n p )X. i ALGORITHM 5 (Find σ b ). Given tolerance ɛ, initial σ, initial residual r = b Ax, and Initialization: Cholesky factorization C x = L x L T x Form GSVD (3.3) for matrix pair [A, L x D]. Form s = U T r, m = m n + p Calculate F () = s m. If F < Stop with error n Set s = m, γ i =, i > p. Do Until Converged (Or max steps reached) estimate of C x, find L b = σ b I m by Newton s method. End (Do) Solution of (3.): s i = si σ +γ i, i =... m, F = s T s m Convergence Test : F ɛ σ = σ + F σ s γi si t i = µ i, i =... p, s i, i = p +... n, and otherw Solve to give ˆx = x + (X T ) t Solution σ b = σ. Covariance Cx = σ (X T ) (diag( ), I νi +σ µ n p )X i REFERENCES

25 [] Bennett A, 5 Inverse Modeling of the Ocean and Atmosphere (Cambridge University Press) p 34. [] Figueras, J., and M. M. Gribb, A new user-friendly automated multi-step outflow test apparatus, Vadose Zone Journal (in preparation). [3] Golub, G. H. and van Loan, C., 996, Matrix Computations, John Hopkins Press, Baltimore, 3rd ed.. [4] Hansen, P. C., 994, Regularization Tools: A Matlab Package for Analysis and Solution of Discrete Ill-posed Problems, Numerical Algorithms [5] Hansen P, 998 Rank-Deficient and Discrete Ill-Posed Problems: Numerical Aspects of Linear Inversion (SIAM Monographs on Mathematical Modeling and Computation 4) [6] Hansen, P. C., Nagy, J. G., and O Leary, D. P., 6, Deblurring Images, Matrices, Spectra and Filtering, (SIAM), Fundamentals of Algorithms. [7] MATLAB is a registered mark of MathWorks, Inc., MathWorks Web Site, [8] McNamara, J. P., Chandler, D. G., Seyfried, M., and Achet, S. 5. Soil moisture states, lateral flow, and streamflow generation in a semi-arid, snowmelt-driven catchment. Hydrological Processes, 9, [9] Mead J., 7, A priori weighting for parameter estimation, J. Inv. Ill-posed Problems, to appear. [] Mualem, Y A new model for predicting the hydraulic conductivity of unsaturated porous media. Water Resour. Res. :53. [] Rao, C. R., 973, Linear Statistical Inference and its applications, Wiley, New York. [] Richards, L.A. Capillary conduction of liquids in porous media, Physics, [3] Tarantola A 5 Inverse Problem Theory and Methods for Model Parameter Estimation (SIAM). [4] van Genuchten, M.Th. 98. A closed-form equation for predicting the hydraulic conductivity of unsaturated soils. Soil Sci. Soc. Am. J. 44: [5] Vogel, C. R.,. Computational Methods for Inverse Problems, (SIAM), Frontiers in Applied Mathematics. 5

Regularization Parameter Estimation for Least Squares: A Newton method using the χ 2 -distribution

Regularization Parameter Estimation for Least Squares: A Newton method using the χ 2 -distribution Rosemary Renaut, Jodi Mead Arizona State and Boise State September 2007 Renaut and Mead (ASU/Boise) Scalar