Uncertainty Quantification for Inverse Problems November 7, 2011
Outline UQ and inverse problems Review: least-squares Review: Gaussian Bayesian linear model Parametric reductions for IP Bias, variance and MSE for Tikhonov Resampling methods Bayesian inversion Exploratory data analysis (by example)
Introduction Inverse problem: to recover an unknown object from indirect noisy observations Example: Deblurring of a signal f Forward problem: µ(x) = 1 A(x t)f (t) dt 0 Noisy data: y i = µ(x i ) + ε i, i = 1,..., n Inverse problem: recover f (t) for t [0, 1]
General Model A = (A i ), A i : H IR, f H y i = A i [f ] + ε i, i = 1,..., n The recovery of f from A[f ] is ill-posed Errors ε i modeled as random Estimate f is an H-valued random variable Basic Question (UQ) How good is the estimate f? What is the distribution of f? Summarize characteristics of the distribution of f E.g., how likely is it that f will be far from f?
Review: least-squares y = Aβ + ε A is n m with n > m and A t A has stable inverse E(ε) = 0, Var(ε) = σ 2 I Least-squares estimate of β: β = arg min y Ab 2 = (A t A) 1 A t y b β is an unbiased estimator of β: E β ( β ) = β β
Covariance matrix of β: Var β ( β ) = σ 2 (A t A) 1 If ε is Gaussian N(0, σ 2 I), then β is the MLE and β N( β, σ 2 (A t A) 1 ) Unbiased estimator of σ 2 : σ 2 = y A β 2 /(n m) = y ŷ 2 /(n dof(ŷ)), where ŷ = H β, H = A (A t A) 1 A t = hat-matrix
Confidence regions 1 α confidence interval for β i : I i (α) : βi ± t α/2,n m σ ( (A t A) 1 ) i,i C.I. s are pre-data Pointwise vs simultaneous 1 α coverage: pointwise: P[ β i I i (α) ] = 1 α simultaneous: P[ β i I i (α) i ] = 1 α Easy to find 1 α joint confidence region R α for β which gives simultaneous coverage i
Residuals r = y ŷ = (I H)y E(r) = 0, Var(r) = σ 2 (I H) If ε N(0, σ 2 I), then: r N(0, σ 2 (I H)) do not behave like original errors ε Corrected residuals r i = r i / 1 H ii Corrected residuals may be used for model validation Other types of residuals (e.g., recursive)
Review: Gaussian Bayesian linear model y β N(Aβ, σ 2 I), β N(µ, τ 2 Σ) µ, σ, τ, Σ known Goal: update prior distribution of β in light of data y. Compute posterior distribution of β given y: β y N(µ y, Σ y ) Σ y = σ 2 ( A t A + (σ 2 /τ 2 )Σ 1 ) 1 µ y = ( A t A + (σ 2 /τ 2 )Σ 1 ) 1 (A t y + (σ 2 /τ 2 )Σ 1 µ) = weighted average of µ ls and µ
Summarizing posterior: µ y = E(µ y) = mean and mode of posterior (MAP) Σ y = Var(µ y) = covariance matrix of posterior 1 α credible interval for β i : (µ y ) i ± z α/2 ( Σ y ) i,i Can also construct 1 α joint credible region for β
Some properties µ y minimizes E δ(y) β 2 among all δ(y) µ y β ls as σ 2 /τ 2 0, µ y µ as σ 2 /τ 2 µ y can be used as a frequentist estimate of β but: µ y is biased its variance is NOT obtained by sampling from the posterior Warning: a 1 α credible regions MAY NOT have 1 α frequentist coverage If µ, σ, τ or Σ unknown use hierarchical or empirical Bayesian and MCMC methods
Parametric reductions for IP Transform infinite to finite dimensional problem: i.e., A[f ] Aa Assumptions y = A[f ] + ε = Aa + δ + ε Function representation: f (x) = φ(x) t a + η(x), A[η] = 0 Penalized least-squares estimate: â = arg min y Ab 2 + λ 2 b t Sb b = (A t A + λ 2 S) 1 A t y f (x) = φ(x)tâ
Example: y i = 1 0 A(x i t)f (t) dt + ε i Quadrature approximation: 1 0 A(x i t)f (t) dt (1/m) Discretized system: m A(x i t j )f (t j ) j=1 y = Af + δ + ε, A ij = A(x i t j ), f i = f (x i ) Regularized estimate: f = arg min g IR m y Ag 2 + λ 2 Dg 2 = (A t A + λ 2 D t D) 1 A t y Ly f (xi ) = f i = e t i Ly
Example: A i continuous on Hilbert space H. Then: orthonormal φ k H, orthonormal v k IR n and non-decreasing λ k 0 such that A[φ k ] = λ k v k, f = f 0 + f 1, with f 0 Null(A), f 1 Null(A) f 0 not constrained by data, f 1 = n i=1 a kφ k Discretized system: y = V a + ε, V = (λ 1 v 1 λ n v n ) Regularized estimate: â = arg min y V b 2 + λ 2 b 2 Ly b IR n f (x) = φ(x) t Ly, φ(x) = (φ i (x))
Example: A i continuous on RKHS H = W m ([0, 1]). Then: H = N m 1 Hm and φ j H and κ j H such that 1 y A[f ] 2 + λ 2 (f (m) 1 ) 2 = (y i κ i, f ) 2 + λ 2 P Hm f 2 0 i f = j a j φ j + η, η span{φ k } Discretized system: Regularized estimate: y = Xa + δ + ε â = arg min y Xb 2 + λ 2 b t Pb Ly b IR n f (x) = φ(x) t Ly, φ(x) = (φ i (x))
To summarize: y = Aa + δ + ε f (x) = φ(x) t a + η(x), A[η] = 0 â = arg min b IR n y Xb 2 + λ 2 b t Pb = (A t A + λ 2 P) 1 A t y f (x) = φ(x)tâ
Bias, variance and MSE of f Is f (x) close to f (x) on average? Bias( f (x) ) = E f (x) f (x) = φ(x) t B λ a+φ(x) t G λ A t δ η(x) Pointwise sampling variability: Pointwise MSE: Var( f (x) ) = σ 2 AG λ φ(x) 2 MSE( f (x) ) = Bias( f (x) ) 2 + Var( f (x) ) Integrated MSE: IMSE( f (x) ) = E f (x) f (x) 2 dx = MSE( f (x) ) dx
What about the bias of f? Upper bounds are sometimes possible Geometric information can be obtained from bounds Optimization approach: find max/min of bias s.t. f S Average bias: choose a prior for f and compute mean bias over prior Example: f in RKHS H, f (x) = ρ x, f Bias( f (x) ) = A x ρ x, f Bias( f (x) ) A x ρ x f
Confidence intervals If ε Gaussian and Bias( f (x) ) B(x), then f (x) ± (zα/2 σ AG λ φ(x) + B(x)) is a 1 α C.I. for f (x) Simultaneous 1 α C.I. s for E( f (x)), x S: find β such that [ ] P sup Z t V (x) β x S α Z i iid N(0, 1), V (x) = K G λ φ(x)/ K G λ φ(x) and use results from maxima of Gaussian processes theory
Residuals r = y Aâ = y ŷ Moments: E(r) = A Bias( â ) + δ Var(r) = σ 2 (I H λ ) 2 Note: Bias( A â ) = A Bias( â ) may be small even if Bias( â ) is significant Corrected residuals: r i = r i /(1 (H λ ) ii )
Estimating σ and λ λ can be chosen using same data y (with GCV, L-curve, discrepancy principle, etc.) or independent training data This selection introduces an additional error If δ 0, σ 2 can be estimated as in least-squares: σ 2 = y ŷ 2 /(n dof(ŷ)) otherwise one may consider y = µ + ε, µ = A[f ] and use methods from nonparametric regression
Resampling methods Idea: If f and σ are good estimates, then one should be able to generate synthetic data consistent with actual observations If f is a good estimate make synthetic data y 1 = A[ f ] + ε i,..., y b = A[ f ] + ε b with ε i from same distribution as ε. Use same estimation procedure to get f i from yi : y 1 f 1,..., y b f b Approximate distribution of f with that of f 1,..., f b
Generating ε similar to ε Parametric resampling: ε N(0, σ 2 I) Nonparametric resampling: sample ε with replacement from corrected residuals Problems: Possibly badly biased and computational intensive Residuals may need to be corrected also for correlation structure A bad f may lead to misleading results Training sets: Let {f 1,..., f k } be a training set (e.g., historical data) For each f j use resampling to estimate MSE( f j ) Study variability of MSE( f j )/ f j
Bayesian inversion Hierarchical model (simple Gaussian case): y a, θ N(Aa + δ, σ 2 I), a θ F a ( θ), θ F θ θ = (δ, σ, τ) It all reduces to sampling from posterior F(a y) use MCMC methods Possible problems: Selection of priors, improper priors Convergence of MCMC Computational cost of MCMC in high dimensions Interpretation of results
Warning Parameters can be tuned so that Tikhonov estimate = f = E(f y) but this does not mean that the uncertainty of f is obtained from the posterior of f given y
Warning Bayesian or frequentist inversions can be used for uncertainty quantification but: Uncertainties in each have different interpretations that should be understood Both make assumptions (sometimes stringent) that should be revealed The validity of the assumptions should be explored exploratory analysis and model validation
Exploratory data analysis (by example) Example: l 2 and l 1 estimates data = y = Af + ε = AW β + ε fl 2 = arg min g y Ag 2 2 + λ 2 Dg 2 2 λ : from GCV fl 1 = W β, β = arg min b y AW b 2 2 + λ b 1 λ : from discrepancy principle E(ε) = 0, Var(ε) = σ 2 I σ = smoothing spline estimate
Example: y, Af & f
Tools: robust simulation summaries Sample mean and variance of x 1,..., x m : Recursive formulas: X = (x 1 + + x m )/m S 2 = 1 (x i m X) 2 i X n+1 = X n + 1 n + 1 (x n+1 X n ) T n+1 = T n + n n + 1 (x n+1 X n ) 2, S 2 n = T n /n Not robust
Robust measures Median and median absolute deviation from median (MAD) X = median{x 1,..., x m } { MAD = median x 1 X m,..., x m X } m Approximate recursive formulas with good asymptotics
Stability of f l 2
Stability of f l 1
CI for bias of A f l 2 Good estimate of f good estimate of Af y: direct observation of Af ; less sensitive to λ 1 α CIs for bias of A f ( ŷ = Hy ): ŷ i y i 1 H ii ± z α/2 σ 1 + (H2 ) ii (H ii ) 2 (1 H ii ) 2
95% CIs for bias
Coverage of 95% CIs for bias
Bias bounds for f l 2 For fixed λ: Var( f l 2) = σ 2 G(λ) 1 A t A G(λ) 1 Bias( f l 2) = E( f l 2 ) f = B(λ), G(λ) = A t A + λ 2 D t D, B(λ) = λ 2 G(λ) 1 D t Df. Median bias: Bias M ( f l 2 ) = median( f l 2 ) f For λ estimated: Bias M ( f l 2 ) median[ B( λ) ]
Hölder s inequality B( λ) i λ 2 U p,i ( λ) Df q B( λ) i λ 2 U w p,i( λ) β q U p,i ( λ) = DG( λ) 1 e i p or U w p,i( λ) = W D t DG( λ) 1 e i p Plots of U p,i ( λ) and U w p,i ( λ) vs x i
Assessing fit Compare y to A f + ε Parametric/nonparametric bootstrap samples y : (i) (parametric) ε i N(0, σ 2 I), or (nonparametric) from corrected residuals r c (ii) y i = A f + ε i Compare statistics of y and {y 1,..., y B } (blue,red) Compare statistics of r c and N(0, σ 2 I) (black) Compare parametric vs nonparametric results Example: Simulations for A f l 2 with: y (1) y (2) y (3) good with skewed noise biased
Example T 1 (y) = min(y) T 2 (y) = max(y) T k (y) = 1 n ( y mean(y) ) k /std(y) k n (k = 3, 4) i=1 T 5 (y) = sample median (y) T 6 (y) = MAD(y) T 7 (y) = 1st sample quartile (y) T 8 (y) = 3rd sample quartile (y) T 9 (y) = # runs above/below median(y) P j = P ( T j (y ) T o j ), T o j = T j (y)
Assessing fit: Bayesian case Hierarchical model: y f, σ, γ N(Af, σ 2 I) f γ exp ( f t D t Df /2γ 2 ) (σ, γ) π with E(f y) = ) 1 (A t A + σ2 γ 2 Dt D A t y Posterior mean = f l 2 for γ = σ/λ Sample (f, σ ) from posterior of (f, σ) y and simulated data y = Af + σ ε with ε N(0, I) Explore consistency of y with simulated y
Simulating f Modify simulations from y i = A f + ε i to y i = Af i + ε i, f i similar to f Frequentist simulation of f? One option: f i from historical data (training set) Check relative MSE for the different f i
Bayesian or frequentist? Results have different interpretations Frequentists can use Bayesian methods to derive procedures; Bayesians can use frequentist methods to evaluate procedures Consistent results for small problems and lots of data Possibly very different results for complex large-scale problems Theorems do not address relation of theory to reality
Basic Bayesian methods: Readings Carlin, JB & Louis, TA, 2008, Bayesian Methods for Data Analysis (3rd Edn.), Chapman & Hall Gelman, A, Carlin, JB, Stern, HS & Rubin, DB, 2003, Bayesian Data Analysis (2nd Edn.), Chapman & Hall Bayesian methods for inverse problems: Calvetti, D. & Somersalo, E, 2007, Introduction to Bayesian Scientific Computing: Ten Lectures on Subjective Computing, Springer Kaipio, J & Somersalo, E, 2004, Statistical and Computational Inverse Problems, Springer Stuart, A.M., Inverse Problems: A Bayesian Perspective, 2010, Acta Numerica, DOI:10.1017/S0962492904 Discussion of frequentist vs Bayesian statistics Freedman, D, Some issues in the foundation of statistics, 1995, Foundations of Sciences, 1, 19 39 Tutorial on frequentist and Bayesian methods for inverse problems Stark, PB & Tenorio, L, 2011, A primer of frequentist and Bayesian inference in inverse problems. In Computational Methods for Large-Scale Inverse Problems and Quantification of Uncertainty, pp.9 32, Wiley Statistics and inverse problems Evans, SN & Stark, PB, 2003, Inverse problems as statistics, Inverse Problems, 18, R55 Marzouk, Y & Tenorio, L, 2012, Uncertainty Quantification for Inverse Problems O Sullivan, F, 1986, A statistical perspective on ill-posed inverse problems, Statistical Science, 1, 502 527 Tenorio, L, Andersson, F, de Hoop, M & Ma, P, 2011, Data analysis tools for uncertainty quantification of inverse problems, Inverse Problems, 27, 045001 Wahba, G, 1990, Spline Smoothing for Observational Data, SIAM Special section in Inverse Problems: Volume 24, Number 3, June 2008 Mathematical statistics Casella, G & Berger, RL, 2001, Statistical Inference, Duxbury Press Schervish, ML, 1996, Theory of Statistics, Springer