Statistically-Based Regularization Parameter Estimation for Large Scale Problems

Statistically-Based Regularization Parameter Estimation for Large Scale Problems Rosemary Renaut Joint work with Jodi Mead and Iveta Hnetynkova March 1, 2010 National Science Foundation: Division of Computational Mathematics 1 / 1

Outline National Science Foundation: Division of Computational Mathematics 2 / 1

Signal/Image Restoration: Integral Model of Signal Degradation b(t) = K(t, s)x(s)ds K(t, s) describes blur of the signal. Convolutional model: invariant K(t, s) = K(t s) is Point Spread Function (PSF). Typically sampling includes noise e(t), model is b(t) = K(t s)x(s)ds + e(t) Discrete model: given discrete samples b, find samples x of x Let A discretize K, assume known, model is given by Naïvely invert the system to find x! b = Ax + e. National Science Foundation: Division of Computational Mathematics 3 / 1

Example 1-D Original and Blurred Noisy Signal Original signal x. Blurred and noisy signal b, Gaussian PSF. National Science Foundation: Division of Computational Mathematics 4 / 1

The Solution: Regularization is needed Naïve Solution A Regularized Solution National Science Foundation: Division of Computational Mathematics 5 / 1

Two dimensional Image Deblurring How do we obtain the deblurred image National Science Foundation: Division of Computational Mathematics 6 / 1

Least Squares for Ax = b: A Quick Review Consider discrete systems: A R m n, b R m, x R n Ax = b + e, Classical Approach Linear Least Squares (A full rank) x LS = arg min x Ax b 2 2 Difficulty x LS sensitive to changes in right hand side b when A is ill-conditioned. For convolutional models system is ill-posed. National Science Foundation: Division of Computational Mathematics 7 / 1

Introduce Regularization to Pick a Solution Weighted Fidelity with Regularization Regularize x RLS (λ) = arg min{ b Ax 2 x W b + λ 2 R(x)}, Weighting matrix W b R(x) is a regularization term λ is a regularization parameter which is unknown. Solution x RLS (λ) depends on λ. depends on regularization operator R depends on the weighting matrix W b National Science Foundation: Division of Computational Mathematics 8 / 1

The Weighting Matrix: Some Assumptions for Multiple Data Measurements Given multiple measurements of data b: Usually error in b, e is an m vector of random measurement errors with mean 0 and positive definite covariance matrix C b = E(ee T ). For uncorrelated measurements C b is diagonal matrix of standard deviations of the errors. (Colored noise) For white noise C b = σ 2 I. Weighting by W b = C b 1 in data fit term, theoretically, ẽ = W b 1/2 e are uncorrelated. Difficulty if W b increases ill-conditioning of A! National Science Foundation: Division of Computational Mathematics 9 / 1

Formulation: Generalized Tikhonov Regularization With Weighting Use R(x) = D(x x 0 ) 2 ˆx = argmin J(x) = argmin{ Ax b 2 W b + λ 2 D(x x 0 ) 2 }. (1) D is a suitable operator, often derivative approximation. Assume N (A) N (D) = {0} x 0 is a reference solution, often x 0 = 0. Question Given D, W b how do we find λ? National Science Foundation: Division of Computational Mathematics 10 / 1

Example: solution for Increasing λ, D = I. National Science Foundation: Division of Computational Mathematics 11 / 1

Choice of λ crucial Different algorithms yield different solutions. Discrepancy Principle L-Curve Generalized Cross Validation (GCV) Unbiased Predictive Risk (UPRE) National Science Foundation: Division of Computational Mathematics 12 / 1

The Discrepancy Principle Suppose noise is white: C b = σ 2 b I. Find λ such that the regularized residual satisfies σ 2 b = 1 m b Ax(λ) 2 2. (2) Can be implemented by a Newton root finding algorithm. But discrepancy principle typically oversmooths. National Science Foundation: Division of Computational Mathematics 13 / 1

Some standard approaches I: L-curve - Find the corner Let r(λ) = (A(λ) A)b: Influence Matrix A(λ) = A(A T W b A+λ 2 D T D) 1 A T Plot log( Dx ), log( r(λ) ) Trade off contributions. Expensive - requires range of λ. GSVD makes calculations efficient. Not statistically based Find corner No corner National Science Foundation: Division of Computational Mathematics 14 / 1

Generalized Cross-Validation (GCV) Let A(λ) = A(A T W b A+λ 2 D T D) 1 A T Can pick W b = I. Minimize GCV function b Ax(λ) 2 W b [trace(i m A(λ))] 2, which estimates predictive risk. Expensive - requires range of λ. GSVD makes calculations efficient. Requires minimum Multiple minima Sometimes flat National Science Foundation: Division of Computational Mathematics 15 / 1

Unbiased Predictive Risk Estimation (UPRE) Minimize expected value of predictive risk: Minimize UPRE function b Ax(λ) 2 W b +2 trace(a(λ)) m Expensive - requires range of λ. GSVD makes calculations efficient. Need estimate of trace Minimum needed National Science Foundation: Division of Computational Mathematics 16 / 1

An Alternative: Iterative Methods with Stopping Criteria Iterate to find approximate solution of Ax b. Use LSQR. Introduce stopping criteria based on determining noise in the solution. e.g. Residual Periodogram (O Leary and Rust). Stop when noise dominates Hnetynkova et al. Hybrid Methods combine iteration and regularization parameter. Cost of regularization of reduced system is minimal Advantage any regularization may be used for subproblem. (Nagy etc) How to be sure when to stop the LSQR iteration? How is statistics included in sub problem? Include statistics directly National Science Foundation: Division of Computational Mathematics 17 / 1

Background: Statistics of the Least Squares Problem Theorem (Rao73: First Fundamental Theorem) Let r be the rank of A and for b N(Ax, σb 2 I), (errors in measurements are normally distributed with mean 0 and covariance σb 2 I), then J = min Ax b 2 σ 2 x bχ 2 (m r). J follows a χ 2 distribution with m r degrees of freedom: Basically the Discrepancy Principle Corollary (Weighted Least Squares) For b N(Ax, C b ), W b = C b 1 then J = min Ax b 2 x W b χ 2 (m r). National Science Foundation: Division of Computational Mathematics 18 / 1

Extension: Statistics of the Regularized Least Squares Problem Thm: χ 2 distribution of the regularized functional (Renaut/Mead 2008) (NOTE: Weighting Matrix on Regularization term.) ˆx = argmin J D (x) = argmin{ Ax b 2 W b + (x x 0 ) 2 W D }, W D = D T W x D. (3) Assume Then W b and W x are symmetric positive definite. Problem is uniquely solvable N (A) N (D) {0}. Moore-Penrose generalized inverse of W D is C D Statistics: Errors in the right hand side e N(0, C b ), and x 0 is known so that (x x 0 ) = f N(0, C D ), x 0 is the mean vector of the model parameters. J D (ˆx(W D )) χ 2 (m + p n) National Science Foundation: Division of Computational Mathematics 19 / 1

Significance of the χ 2 result For sufficiently large m = m + p n J D χ 2 (m + p n) E(J(x(W D ))) = m + p n E(JJ T ) = 2(m + p n) Moreover m 2 mz α/2 < J(ˆx(W D )) < m + 2 mz α/2. (4) z α/2 is the relevant z-value for a χ 2 -distribution with m = m + p n degrees National Science Foundation: Division of Computational Mathematics 20 / 1

Key Aspects of the Proof I: The Functional J Algebraic Simplifications: Rewrite functional as quadratic form Regularized solution given in terms of resolution matrix R(W D ) ˆx = x 0 + (A T W b A + D T W x D) 1 A T W b r, (5) = x 0 + R(W D )W 1/2 b r, r = b Ax 0 = x 0 + y(w D ). (6) R(W D ) = (A T W b A + D T W x D) 1 A T 1/2 W b (7) Functional is given in terms of influence matrix A(W D ) A(W D ) = W b 1/2 AR(W D ) (8) J D (ˆx) = r T W b 1/2 (I m A(W D ))W b 1/2 r, let r = W b 1/2 r (9) = r T (I m A(W D )) r. A Quadratic Form (10) National Science Foundation: Division of Computational Mathematics 21 / 1

Key Aspects of the Proof II : Properties of a Quadratic Form χ 2 distribution of Quadratic Forms x T Px for normal variables (Fisher- Cochran Theorem) Components x i are independent normal variables x i N(0, 1), i = 1 : n. A necessary and sufficient condition that x T Px has a central χ 2 distribution is that P is idempotent, P 2 = P. In which case the degrees of freedom of χ 2 is rank(p) =trace(p) = n.. When the means of x i are µ i 0, x T Px has a non-central χ 2 distribution, with non-centrality parameter c = µ T Pµ A χ 2 random variable with n degrees of freedom and centrality parameter c has mean n + c and variance 2(n + 2c). National Science Foundation: Division of Computational Mathematics 22 / 1

Key Aspects of the Proof III: Requires the GSVD Lemma Assume invertibility and m n p. There exist unitary matrices U R m m, V R p p, and a nonsingular matrix X R n n such that [ ] Υ A = U X T D = V[M, 0 p (n p) ]X T, (11) 0 (m n) n Υ = diag(υ 1,..., υ p, 1,..., 1) R n n, M = diag(µ 1,..., µ p ) R p p, 0 υ 1 υ p 1, 1 µ 1 µ p > 0, υi 2 + µ 2 i = 1, i = 1,... p. (12) The Functional with the GSVD Let Q = diag(µ 1,..., µ p, 0 n p, I m n ) then J = r T (I m A(W D )) r = QU T r 2 2 = k 2 2 National Science Foundation: Division of Computational Mathematics 23 / 1

Proof IV: Statistical Distribution of the Weighted Residual Covariance Structure Errors in b are e N(0, C b ). Now b depends on x, b = Ax hence we can show b N(Ax 0, C b + AC D A T ) (x 0 is mean of x) Residual r = b Ax N(0, C b + AC D A T ). r = W b 1/2 r N(0, I + ÃC D Ã T ), Ã = W b 1/2 A. Use the GSVD I + ÃC D Ã T = UQ 2 U T, Q = diag(µ 1,..., µ p, I n p, I m n ) Now k = QU T r then k N(0, QU T (UQ 2 U T )UQ) N(0, I m ) But J = QU T r 2 = k 2, where k is the vector k excluding components p + 1 : n. Thus J D χ 2 (m + p n). National Science Foundation: Division of Computational Mathematics 24 / 1

When mean of the parameters is not known, or x 0 = 0 is not the mean Corollary: non-central χ 2 distribution of the regularized functional Recall ˆx = argmin J D (x) = argmin{ Ax b 2 W b + (x x 0 ) 2 W D }, W D = D T W x D. Assume all assumptions as before, but x x 0 is the mean vector of the model parameters. Let c = c 2 2 = QU T W b 1/2 A(x x 0 ) 2 2 Then J D χ 2 (m + p n, c) The functional at optimum follows a non central χ 2 distribution National Science Foundation: Division of Computational Mathematics 25 / 1

A further result when A is not of full column rank The Rank Deficient Solution Suppose A is not full column rank. Then the filtered solution can be written in terms of the GSVD x FILT (λ) = p i=p+1 r γ 2 i υ i (γ 2 i + λ 2 ) s i x i + n i=p+1 s i x i = p i=1 f i υ i s i x i + n i=p+1 s i x i. Here f i = 0, i = 1 : p r, f i = γ 2 i /(γ2 i + λ 2 ), i = p r + 1 : p. This yields J(x FILT (λ)) χ 2 (m n + r, c) notice degrees of freedom are reduced. National Science Foundation: Division of Computational Mathematics 26 / 1

General Result The Cost Functional follows a χ 2 Statistical Distribution Suppose degrees of freedom m and centrality parameter c then E(J D ) = m + c E(J D J T D) = 2( m) + 4c Suggests: Try to find W D so that E(J) = m + c Mead expensive nonlinear algorithm, c = 0 general W D. First find λ only. Find W x = λ 2 I National Science Foundation: Division of Computational Mathematics 27 / 1

What do we need to apply the Theory? Requirements Covariance C b on data parameters b (or on model parameters x!) A priori information x 0, mean x. But x (and hence x 0 ) are not known. If not known use repeated data measurements calculate C b and mean b. Hence estimate the centrality parameter E(b) = AE(x) implies b = Ax. Hence c = c 2 2 = QU T W b 1/2 (b Ax 0 ) 2 2 E(J D ) = E( QU T W b 1/2 (b Ax 0 ) 2 2 ) = m + p n + c 2 2 Given the GSVD estimate the degrees of freedom m. Then we can use E(J) to find λ National Science Foundation: Division of Computational Mathematics 28 / 1

Assume x 0 is the mean (experimentalists know something about model parameters) DESIGNING THE ALGORITHM: I Recall: if C b and C x are good estimates of covariance should be small. J D (ˆx) (m + p n) Thus, let m = m + p n then we want m 2 mz α/2 < J(x(W D )) < m + 2 mz α/2. z α/2 is the relevant z-value for a χ 2 -distribution with m degrees GOAL Find W x to make (??) tight: Single Variable case find λ J D (ˆx(λ)) m National Science Foundation: Division of Computational Mathematics 29 / 1

A Newton-line search Algorithm to find λ = 1/σ. (Basic algebra) Newton to Solve F(σ) = J D (σ) m = 0 We use σ = 1/λ, and y(σ (k) ) is the current solution for which x(σ (k) ) = y(σ (k) ) + x 0 Then σ J(σ) = 2 σ 3 Dy(σ) 2 < 0 Hence we have a basic Newton Iteration σ (k+1) = σ (k) (1 + 1 2 ( σ(k) Dy )2 (J D (σ (k) ) m)). National Science Foundation: Division of Computational Mathematics 30 / 1

Algorithm Using the GSVD GSVD Use GSVD of [W 1/2 b A, D] For γ i the generalized singular values, and s = U T W 1/2 b r m = m n + p s i = s i /(γi 2σ2 x + 1), i = 1,..., p, t i = s i γ i. Find root of F(σ x ) = p 1 ( γi 2σ2 x + 1 )s2 i + i=1 m s 2 i m = 0 i=n+1 Equivalently: solve F = 0, where F(σ x ) = s T s m and F (σ x ) = 2σ x t 2 2. National Science Foundation: Division of Computational Mathematics 31 / 1

Discussion on Convergence F is monotonic decreasing (F (σ x ) = 2σ x Dy 2 2 ) Solution either exists and is unique for positive σ Or no solution exists F(0) < 0. implies incorrect statistics of the model Theoretically, lim σ F > 0 possible. Equivalent to λ = 0. No regularization needed. National Science Foundation: Division of Computational Mathematics 32 / 1

Practical Details of Algorithm Find the parameter Step 1: Bracket the root by logarithmic search on σ to handle the asymptotes: yields sigmamax and sigmamin Step 2: Calculate step, with steepness controlled by told. Let t = Dy/σ (k), where y is the current update, then step = 1 2 ( 1 max { t, told} )2 (J D (σ (k) ) m) Step 3: Introduce line search α (k) in Newton sigmanew = σ (k) (1 + α (k) step) α (k) chosen such that sigmanew within bracket. National Science Foundation: Division of Computational Mathematics 33 / 1

Practical Details of Algorithm: Large Scale problems Algorithm Initialization Convert generalized Tikhonov problem to standard form.( if L is not invertible you just need to know how to find Ax and A T x, and the null space of L) Use LSQR algorithm to find the bidiagonal matrix for the projected problem. Obtain a solution of the bidiagonal problem for given initial σ. Subsequent Steps Increase dimension of space if needed with reuse of existing bidiagonalization. May also use smaller size system if appropriate. Each σ calculation of algorithm reuses saved information from the Lancos bidiagonalization. National Science Foundation: Division of Computational Mathematics 34 / 1

Comparison with Standard LSQR hybrid Algorithm Algorithm can concurrently regularize and find λ Standard hybrid LSQR solves projected system then adds regularization. This is also possible: results of both approaches. Advantages Costs Needs only cost of standard LSQR algorithm with some updates for solution solves for iterated σ. The regularization introduced by LSQR projection may be useful for preventing problems with GSVD expansion. Makes algorithm viable for large scale problems. National Science Foundation: Division of Computational Mathematics 35 / 1

Illustrating the Results for Problem Size 512: Two Standard Test Problems omparison for noise level 10%. On left D = I and on right D is first derivative Notice L-curve and χ 2 -LSQR perform well. UPRE does not perform well. National Science Foundation: Division of Computational Mathematics 36 / 1

Real Data: Seismic Signal Restoration The Data Set and Goal Real data set of 48 signals of length 3000. The point spread function is derived from the signals. Calculate the signal variance pointwise over all 48 signals. Goal: restore the signal x from Ax = b, where A is PSF matrix and b is given blurred signal. Method of Comparison- no exact solution known: use convergence with respect to downsampling. National Science Foundation: Division of Computational Mathematics 37 / 1

Comparison High Resolution White noise Greater contrast with χ 2. UPRE is insufficiently regularized. L-curve severely undersmooths (not shown). Parameters not consistent across resolutions National Science Foundation: Division of Computational Mathematics 38 / 1

THE UPRE SOLUTION: x 0 = 0 White Noise Regularization Parameters are consistent: σ = 0.01005 all resolutions National Science Foundation: Division of Computational Mathematics 39 / 1

THE LSQR Hybrid SOLUTION: White Noise Regularization quite consistent resolution 2 to 100 σ = 0.0000029,.0000029,.0000029,.0000057,.0000057 National Science Foundation: Division of Computational Mathematics 40 / 1

Illustrating the Deblurring Result: Problem Size 65536 Example taken from RESTORE TOOLS Nagy et al 2007-8: 15% Noise Computational Cost is Minimal: Projected Problem Size is 15, λ =.58 National Science Foundation: Division of Computational Mathematics 41 / 1

Problem Grain noise 15% added for increasing subproblem size Signal to noise ratio 10 log 10 (1/e) relative error e Regularization parameter σ in each case. National Science Foundation: Division of Computational Mathematics 42 / 1

Illustrating the progress of the Newton algorithm post LSQR National Science Foundation: Division of Computational Mathematics 43 / 1

Illustrating the progress of the Newton algorithm with LSQR National Science Foundation: Division of Computational Mathematics 44 / 1

Problem Grain noise 15% added for increasing subproblem size Signal to noise ratio 10 log 10 (1/e) relative error e National Science Foundation: Division of Computational Mathematics 45 / 1

Conclusions Observations A new statistical method for estimating regularization parameter Compares favorably with UPRE with respect to performance and compared to L-curve. (GCV is not competitive). Method can be used for large scale problems. PROBLEM SIZE of Hybrid LSQR USED IS VERY SMALL Method is very efficient, Newton method is robust and fast. a priori information is needed. Can be obtained directly from the data. e.g. Use local statistical information of image National Science Foundation: Division of Computational Mathematics 46 / 1

Future Work Other Results and Future Work Software Package! Diagonal Weighting Schemes Edge preserving regularization - Total Variation Better handling of Colored Noise. Bound Constraints (with Mead accepted). National Science Foundation: Division of Computational Mathematics 47 / 1

THANK YOU! National Science Foundation: Division of Computational Mathematics 48 / 1