Statistical Methods for Image Reconstruction

Size: px

Start display at page:

Download "Statistical Methods for Image Reconstruction"

Jewel McKenzie
5 years ago
Views:

1 Statistical Methods for Image Reconstruction Jeffrey A. Fessler EECS Department The University of Michigan NSS-MIC Oct. 19,

2 Image Reconstruction Methods (Simplified View) Analytical (FBP) Iterative (OSEM?) 0.1

3 Image Reconstruction Methods / Algorithms ANALYTICAL FBP BPF Gridding... Algebraic (y = Ax) ART MART SMART... ITERATIVE (Weighted) Least Squares CG CD ISRA... Statistical Likelihood (e.g., Poisson) EM (etc.) OSEM SAGE CG Int. Point GCA PSCD FSCD

4 Outline Part 0: Introduction / Overview / Examples Part 1: From Physics to Statistics (Emission tomography) Assumptions underlying Poisson statistical model Emission reconstruction problem statement Part 2: Four of Five Choices for Statistical Image Reconstruction Object parameterization System physical modeling Statistical modeling of measurements Cost functions and regularization Part 3: Fifth Choice: Iterative algorithms Classical optimization methods Considerations: nonnegativity, convergence rate,... Optimization transfer: EM etc. Ordered subsets / block iterative / incremental gradient methods Part 4: Performance Analysis Spatial resolution properties Noise properties Detection performance Part 5: Miscellaneous topics (?)

5 History Successive substitution method vs direct Fourier (Bracewell, 1956) Iterative method for X-ray CT (Hounsfield, 1968) ART for tomography (Gordon, Bender, Herman, JTB, 1970) Richardson/Lucy iteration for image restoration (1972, 1974) Weighted least squares for 3D SPECT (Goitein, NIM, 1972) Proposals to use Poisson likelihood for emission and transmission tomography Emission: (Rockmore and Macovski, TNS, 1976) Transmission: (Rockmore and Macovski, TNS, 1977) Expectation-maximization (EM) algorithms for Poisson model Emission: (Shepp and Vardi, TMI, 1982) Transmission: (Lange and Carson, JCAT, 1984) Regularized (aka Bayesian) Poisson emission reconstruction (Geman and McClure, ASA, 1985) Ordered-subsets EM algorithm (Hudson and Larkin, TMI, 1994) Commercial introduction of OSEM for PET scanners circa

6 Why Statistical Methods? Object constraints (e.g., nonnegativity, object support) Accurate physical models (less bias = improved quantitative accuracy) (e.g., nonuniform attenuation in SPECT) improved spatial resolution? Appropriate statistical models (less variance = lower image noise) (FBP treats all rays equally) Side information (e.g., MRI or CT boundaries) Nonstandard geometries (e.g., irregular sampling or missing data) Disadvantages? Computation time Model complexity Software complexity Analytical methods (a different short course!) Idealized mathematical model Usually geometry only, greatly over-simplified physics Continuum measurements (discretize/sample after solving) No statistical model Easier analysis of properties (due to linearity) e.g., Huesman (1984) FBP ROI variance for kinetic fitting 0.5

7 What about Moore s Law? 0.6

8 Benefit Example: Statistical Models Soft Tissue True FBP 1 PWLS 1 PL 1 Cortical Bone NRMS Error Method Soft Tissue Cortical Bone FBP 22.7% 29.6% PWLS 13.6% 16.2% PL 11.8% 15.8% 0.7

9 Benefit Example: Physical Models a. True object a. Soft tissue corrected FBP b. Unocrrected FBP b. JS corrected FBP c. Monoenergetic statistical reconstruction c. Polyenergetic Statistical Reconstruction

10 Benefit Example: Nonstandard Geometries 0.9 Photon Source Detector Bins

11 Truncated Fan-Beam SPECT Transmission Scan Truncated Truncated Untruncated FBP PWLS FBP 0.10

12 One Final Advertisement: Iterative MR Reconstruction 0.11

13 Part 1: From Physics to Statistics or What quantity is reconstructed? (in emission tomography) Outline Decay phenomena and fundamental assumptions Idealized detectors Random phenomena Poisson measurement statistics State emission tomography reconstruction problem 1.1

14 What Object is Reconstructed? In emission imaging, our aim is to image the radiotracer distribution. The what? At time t = 0 we inject the patient with some radiotracer, containing a large number N of metastable atoms of some radionuclide. Let X k (t) R 3 denote the position of the kth tracer atom at time t. These positions are influenced by blood flow, patient physiology, and other unpredictable phenomena such as Brownian motion. The ultimate imaging device would provide an exact list of the spatial locations X 1 (t),..., X N (t) of all tracer atoms for the entire scan. Would this be enough? 1.2

15 Atom Positions or Statistical Distribution? Repeating a scan would yield different tracer atom sample paths { X k (t)... statistical formulation } N k=1. Assumption 1. The spatial locations of individual tracer atoms at any time t 0 are independent random variables that are all identically distributed according to a common probability density function (pdf) p t ( x). This pdf is determined by patient physiology and tracer properties. Larger values of p t ( x) correspond to hot spots where the tracer atoms tend to be located at time t. Units: inverse volume, e.g., atoms per cubic centimeter. The radiotracer distribution p t ( x) is the quantity of interest. (Not { X k (t) } N k=1!) 1.3

Example: Perfect Detector 6 Radiotracer Distribution x 10 3 15 6 N = 2000 x_2 0 x_2 0 6 6 0 x_1 6 True radiotracer distribution p

16 Example: Perfect Detector 6 Radiotracer Distribution x N = 2000 x_2 0 x_ x_1 6 True radiotracer distribution p t ( x) at some time t x_1 A realization of N = 2000 i.i.d. atom positions (dots) recorded exactly. Little similarity! 1.4

Binning/Histogram Density Estimator Histogram Density Estimate 6 0.

17 Binning/Histogram Density Estimator Histogram Density Estimate x_ x_1 6 Estimate of p t ( x) formed by histogram binning of N = 2000 points. Ramp remains difficult to visualize. 1.5

18 Kernel Density Estimator 6 Gaussian Kernel Density Estimate x Horizontal Profile 0.01 x2 0 Density w = x 1 0 True Bin Kernel x 1 Gaussian kernel density estimator for p t ( x) from N = 2000 points. Horizontal profiles at x 2 = 3 through density estimates. 1.6

19 Poisson Spatial Point Process Assumption 2. The number of administered tracer atoms N has a Poisson distribution with some mean µ N E[N] = np{n = n}. n=0 Let N t (B) denote the number of tracer atoms that have spatial locations in any set B R 3 (VOI) at time t after injection. N t ( ) is called a Poisson spatial point process. Fact. For any set B, N t (B) is Poisson distributed with mean: { E[N t (B)] = E[N]P X k (t) B } Z = µ N p t ( x)d x. Poisson N injected atoms + i.i.d. locations = Poisson point process B 1.7

20 Illustration of Point Process (µ N = 200) 25 points within ROI 15 points within ROI x x x 1 20 points within ROI x 1 26 points within ROI x x x x 1 1.8

21 Radionuclide Decay Preceding quantities are all unobservable. We observe a tracer atom only when it decays and emits photon(s). The time that the kth tracer atom decays is a random variable T k. Assumption 3. The T k s are statistically independent random variables, and are independent of the (random) spatial location. Assumption 4. Each T k has an exponential distribution with mean µ T = t 1/2 /ln2. Cumulative distribution function: P{T k t} = 1 exp( t/µ T ) 1 P[T k t] t 1/2 t / µ T 1.9

22 Statistics of an Ideal Decay Counter Let K t (B) denote the number of tracer atoms that decay by time t, and that were located in the VOI B R 3 at the time of decay. Fact. K t (B) is a Poisson counting process with mean E[K t (B)] = Z t 0 Z B λ( x,τ)d xdτ, where the (nonuniform) emission rate density is given by λ( x,t) µ N e t/µ T Ingredients: dose, decay, distribution µ T p t ( x). Units: counts per unit time per unit volume, e.g., µci/cc. Photon emission is a Poisson process What about the actual measurement statistics? 1.10

23 Idealized Detector Units A nuclear imaging system consists of n d conceptual detector units. Assumption 5. Each decay of a tracer atom produces a recorded count in at most one detector unit. Let S k {0,1,...,n d } denote the index of the incremented detector unit for decay of kth tracer atom. (S k = 0 if decay is undetected.) Assumption 6. The S k s satisfy the following conditional independence: { } N { } P S 1,...,S N N, T 1,...,T N, X 1 ( ),..., X N ( ) = P S k X k (T k ). k=1 The recorded bin for the kth tracer atom s decay depends only on its position when it decays, and is independent of all other tracer atoms. (No event pileup; no deadtime losses.) 1.11

24 PET Example i = 1 Sinogram Ray i Angular Positions Radial Positions i = n d n d (n crystals 1) n crystals /2 1.12

25 SPECT Example i = 1 Sinogram Radial Positions 1.13 Angular Positions nd = nradial bins nangular steps i = nd Collimator / Detector

26 Detector Unit Sensitivity Patterns Spatial localization: s i ( x) probability that decay at x is recorded by ith detector unit. Idealized Example. Shift-invariant PSF: s i ( x) = h( k i x r i ) r i is the radial position of ith ray k i is the unit vector orthogonal to ith parallel ray h( ) is the shift-invariant radial PSF (e.g., Gaussian bell or rectangular function) x 2 h(r r i ) r r i k i x x k i x

27 Example: SPECT Detector-Unit Sensitivity Patterns s 1 ( x) s 2 ( x) x 2 x 1 Two representative s i ( x) functions for a collimated Anger camera. 1.15

28 Example: PET Detector-Unit Sensitivity Patterns

29 Detector Unit Sensitivity Patterns s i ( x) can include the effects of geometry / solid angle collimation scatter attenuation detector response / scan geometry duty cycle (dwell time at each angle) detector efficiency positron range, noncollinearity... System sensitivity pattern: s( x) n d s i ( x) = 1 s 0 ( x) 1 i=1 (probability that decay at location x will be detected at all by system) 1.17

30 (Overall) System Sensitivity Pattern: s( x) = n d i=1 s i( x) x 2 x 1 Example: collimated 180 SPECT system with uniform attenuation. 1.18

31 Detection Probabilities s i ( x 0 ) (vs det. unit index i) s i ( x 0 ) θ x 2 x 0 x 1 Image domain r Sinogram domain 1.19

32 Summary of Random Phenomena Number of tracer atoms injected N Spatial locations of tracer atoms { X k (t) Time of decay of tracer atoms {T k } N k=1 Detection of photon [S k 0] Recording detector unit {S k } n d i=1 } N k=1 1.20

33 Emission Scan Record events in each detector unit for t 1 t t 2. Y i number of events recorded by ith detector unit during scan, for i = 1,...,n d. Y i N k=1 1 {S k =i, T k [t 1,t 2 ]}. The collection {Y i : i = 1,...,n d } is our sinogram. Note 0 Y i N. Fact. Under Assumptions 1-6 above, Z } Y i Poisson{ s i ( x)λ( x)d x (cf line integral ) and Y i s are statistically independent random variables, where the emission density is given by λ( x) = µ N Z t2 t 1 1 µ T e t/µ T p t ( x)dt. (Local number of decays per unit volume during scan.) Ingredients: dose (injected) duration of scan decay of radionuclide distribution of radiotracer 1.21

34 Poisson Statistical Model (Emission) Actual measured counts = foreground counts + background counts. Sources of background counts: cosmic radiation / room background random coincidences (PET) scatter not accounted for in s i ( x) crosstalk from transmission sources in simultaneous T/E scans anything else not accounted for by R s i ( x)λ( x)d x Assumption 7. The background counts also have independent Poisson distributions. Statistical model (continuous to discrete) { Z } Y i Poisson s i ( x)λ( x)d x+r i, i = 1,...,n d r i : mean number of background counts recorded by ith detector unit. 1.22

35 Emission Reconstruction Problem Estimate the emission density λ( x) using (something like) this model: { Z } Y i Poisson s i ( x)λ( x)d x+r i, i = 1,...,n d. Knowns: {Y i = y i } n d i=1 : observed counts from each detector unit s i ( x) sensitivity patterns (determined by system models) r i s : background contributions (determined separately) Unknown: λ( x) 1.23

36 List-mode acquisitions Recall that conventional sinogram is temporally binned: Y i N k=1 This binning discards temporal information. 1 {Sk =i, T k [t 1,t 2 ]}. List-mode measurements: record all (detector,time) pairs in a list, i.e., {(S k,t k ) : k = 1,...,N}. List-mode dynamic reconstruction problem: Estimate λ ( x,t) given {(S k,t k )}. 1.24

37 Emission Reconstruction Problem - Illustration λ( x) {Y i } x 2 θ x 1 r 1.25

38 Example: MRI Sensitivity Pattern x2 x 1 Each k-space sample corresponds to a sinusoidal pattern weighted by: RF receive coil sensitivity pattern phase effects of field inhomogeneity spin relaxation effects. Z ( ) y i = f( x)s i ( x)d x+ε i, s i ( x) = c RF ( x)exp( ıω( x)t i )exp( t i /T 2 ( x))exp ı2π k(t i ) x 1.26

39 Continuous-Discrete Models Emission tomography: y i Poisson{ R λ( x)s i ( x)d x+r i } Transmission tomography (monoenergetic): { ( y i Poisson b i exp ı R } L i µ( x)dl )+r i R ( Transmission (polyenergetic): y i Poisson{ Ii (E)exp ı R } L i µ( x,e)dl )de +r i Magnetic resonance imaging: y i = R f( x)s i ( x)d x+ε i Discrete measurements y = (y 1,...,y nd ) Continuous-space unknowns: λ( x), µ( x), f( x) Goal: estimate f( x) given y Solution options: Continuous-continuous formulations ( analytical ) Continuous-discrete formulations usually ˆf( x) = n d i=1 c i s i ( x) Discrete-discrete formulations f( x) n p j=1 x j b j ( x) 1.27

40 Part 2: Five Categories of Choices Object parameterization: function f( r) vs finite coefficient vector x System physical model: {s i ( r)} Measurement statistical model y i? Cost function: data-mismatch and regularization Algorithm / initialization No perfect choices - one can critique all approaches! 2.1

41 Choice 1. Object Parameterization Finite measurements: {y i } n d i=1. Continuous object: f( r). Hopeless? All models are wrong but some models are useful. Linear series expansion approach. Replace f( r) by x = (x 1,...,x np ) where f( r) f( r) = n p x j b j ( r) j=1 basis functions Forward projection: Z s i ( r) f( r)d r = = [ ] Z np n p [ Z s i ( r) x j b j ( r) d r = s i ( r)b j ( r)d r] x j j=1 j=1 n p Z a i j x j = [Ax] i, where a i j s i ( r)b j ( r)d r j=1 Projection integrals become finite summations. a i j is contribution of jth basis function (e.g., voxel) to ith detector unit. The units of a i j and x j depend on the user-selected units of b j ( r). The n d n p matrix A = {a i j } is called the system matrix. 2.2

42 (Linear) Basis Function Choices Fourier series (complex / not sparse) Circular harmonics (complex / not sparse) Wavelets (negative values / not sparse) Kaiser-Bessel window functions (blobs) Overlapping circles (disks) or spheres (balls) Polar grids, logarithmic polar grids Natural pixels {s i ( r)} B-splines (pyramids) Rectangular pixels / voxels (rect functions) Point masses / bed-of-nails / lattice of points / comb function Organ-based voxels (e.g., from CT),... Considerations Represent f( r) well with moderate n p Orthogonality? (not essential) Easy to compute a i j s and/or Ax Rotational symmetry If stored, the system matrix A should be sparse (mostly zeros). Easy to represent nonnegative functions e.g., if x j 0, then f( r) 0. A sufficient condition is b j ( r)

43 Nonlinear Object Parameterizations Estimation of intensity and shape (e.g., location, radius, etc.) Surface-based (homogeneous) models Circles / spheres Ellipses / ellipsoids Superquadrics Polygons Bi-quadratic triangular Bezier patches,... Other models Generalized series f( r) = j x j b j ( r, θ) Deformable templates f( r) = b(t θ ( r))... Considerations Can be considerably more parsimonious If correct, yield greatly reduced estimation error Particularly compelling in limited-data problems Often oversimplified (all models are wrong but...) Nonlinear dependence on location induces non-convex cost functions, complicating optimization 2.4

44 Example Basis Functions - 1D 4 Continuous object f( r) Piecewise Constant Approximation Quadratic B Spline Approximation x 2.5

45 Pixel Basis Functions - 2D µ 0 (x,y) x x Continuous image f( r) Pixel basis approximation n p j=1 x j b j ( r) 2.6

46 Blobs in SPECT: Qualitative Post filt. OSEM (3 pix. FWHM) blob based α= Post filt. OSEM (3 pix. FWHM) rotation based Post filt. OSEM (3 pix. FWHM) blob based α= ˆxR ˆx ˆx B0 B1 x mm (2D SPECT thorax phantom simulations) 2.7

47 Blobs in SPECT: Quantitative Standard deviation vs. bias in reconstructed phantom images Per iteration, rotation based Per iteration, blob based α=10.4 Per iteration, blob based α=0 Per FWHM, rotation based Per FWHM, blob based α=10.4 Per FWHM, blob based α=0 FBP 2 Standard deviation (%) Bias (%) 2.8

48 Discrete-Discrete Emission Reconstruction Problem Having chosen a basis and linearly parameterized the emission density... Estimate the emission density coefficient vector x = (x 1,...,x np ) (aka image ) using (something like) this statistical model: y i Poisson { np a i j x j + r i }, i = 1,...,n d. j=1 {y i } n d i=1 : observed counts from each detector unit A = {a i j } : system matrix (determined by system models) r i s : background contributions (determined separately) Many image reconstruction problems are find x given y where y i = g i ([Ax] i )+ε i, i = 1,...,n d. 2.9

49 Choice 2. System Model System matrix elements: a i j = scan geometry collimator/detector response attenuation scatter (object, collimator, scintillator) duty cycle (dwell time at each angle) detector efficiency / dead-time losses positron range, noncollinearity, crystal penetration, Z s i ( r)b j ( r)d r Considerations Improving system model can improve Quantitative accuracy Spatial resolution Contrast, SNR, detectability Computation time (and storage vs compute-on-fly) Model uncertainties (e.g., calculated scatter probabilities based on noisy attenuation map) Artifacts due to over-simplifications 2.10

50 Measured System Model? Determine a i j s by scanning a voxel-sized cube source over the imaging volume and recording counts in all detector units (separately for each voxel). Avoids mathematical model approximations Scatter / attenuation added later (object dependent), approximately Small probabilities = long scan times Storage Repeat for every voxel size of interest Repeat if detectors change 2.11

51 Line Length System Model ith ray x 1 x 2 a i j length of intersection 2.12

52 Strip Area System Model ith ray x 1 x j 1 a i j area 2.13

53 (Implicit) System Sensitivity Patterns n d a i j s( r j ) = i=1 n d s i ( r j ) i=1 Line Length Strip Area 2.14

54 Point-Lattice Projector/Backprojector x 1 x 2 ith ray a i j s determined by linear interpolation 2.15

55 Point-Lattice Artifacts Projections (sinograms) of uniform disk object: 0 45 θ r r Point Lattice Strip Area 2.16

56 Forward- / Back-projector Pairs Forward projection (image domain to projection domain): ȳy i = Z s i ( r) f( r)d r = n p a i j x j = [Ax] i, or ȳy = Ax j=1 Backprojection (projection domain to image domain): { } np nd A y = a i j y i i=1 j=1 The term forward/backprojection pair often corresponds to an implicit choice for the object basis and the system model. Often A y is implemented as By for some backprojector B A Least-squares solutions (for example): ˆx = [A A] 1 A y [BA] 1 By 2.17

57 Mismatched Backprojector B A x ˆx (PWLS-CG) ˆx (PWLS-CG) Matched Mismatched 2.18

58 Horizontal Profiles ˆf(x1,32) Matched Mismatched x

59 System Model Tricks Factorize (e.g., PET Gaussian detector response) A SG (geometric projection followed by Gaussian smoothing) Symmetry Rotate and Sum Gaussian diffusion for SPECT Gaussian detector response Correlated Monte Carlo (Beekman et al.) In all cases, consistency of backprojector with A requires care. 2.20

60 SPECT System Model Collimator / Detector Complications: nonuniform attenuation, depth-dependent PSF, Compton scatter 2.21

61 Choice 3. Statistical Models After modeling the system physics, we have a deterministic model: y i g i ([Ax] i ) for some functions g i, e.g., g i (l) = l + r i for emission tomography. Statistical modeling is concerned with the aspect. Considerations More accurate models: can lead to lower variance images, may incur additional computation, may involve additional algorithm complexity (e.g., proper transmission Poisson model has nonconcave log-likelihood) Statistical model errors (e.g., deadtime) Incorrect models (e.g., log-processed transmission data) 2.22

62 Statistical Model Choices for Emission Tomography None. Assume y r = Ax. Solve algebraically to find x. White Gaussian noise. Ordinary least squares: minimize y Ax 2 Non-white Gaussian noise. Weighted least squares: minimize y Ax 2 n d W = w i (y i [Ax] i ) 2, where [Ax] i i=1 (e.g., for Fourier rebinned (FORE) PET data) n p a i j x j j=1 Ordinary Poisson model (ignoring or precorrecting for background) Poisson model y i Poisson{[Ax] i } y i Poisson{[Ax] i + r i } Shifted Poisson model (for randoms precorrected PET) y i = y prompt i y delay i Poisson{[Ax] i + 2r i } 2r i 2.23

63 Precorrected random coincidences: Shifted Poisson model for PET y i = y prompt i y delay i y prompt i Poisson{[Ax] i + r i } y delay i Poisson{r i } E[y i ] = [Ax] i Var{y i } = [Ax] i + 2r i Mean Variance = not Poisson! Statistical model choices Ordinary Poisson model: ignore randoms [y i ] + Poisson{[Ax] i } Causes bias due to truncated negatives Data-weighted least-squares (Gaussian model): y i N ( ) [Ax] i, ˆσ 2 i, ˆσ 2 i = max ( ) y i + 2ˆr i,σ 2 min Causes bias due to data-weighting Shifted Poisson model (matches 2 moments): [y i + 2ˆr i ] + Poisson{[Ax] i + 2ˆr i } Insensitive to inaccuracies in ˆr i. One can further reduce bias by retaining negative values of y i + 2ˆr i. 2.24

64 Shifted-Poisson Model for X-ray CT A model that includes both photon variability and electronic readout noise: y i Poisson{ȳy i (µ)}+n ( 0,σ 2) Shifted Poisson approximation [ yi + σ 2] + Poisson{ ȳy i (µ)+σ 2} or just use WLS... Complications: Intractability of likelihood for Poisson+Gaussian Compound Poisson distribution due to photon-energy-dependent detector signal. X-ray statistical models is a current research area in several groups! 2.25

65 Choice 4. Cost Functions Components: Data-mismatch term Regularization term (and regularization parameter β) Constraints (e.g., nonnegativity) Ψ(x) = DataMismatch(y, Ax) + β Roughness(x) ˆx argmin Ψ(x) x 0 Actually several sub-choices to make for Choice 4... Distinguishes statistical methods from algebraic methods for y = Ax. 2.26

66 Why Cost Functions? (vs procedure e.g., adaptive neural net with wavelet denoising) Theoretical reasons ML is based on minimizing a cost function: the negative log-likelihood ML is asymptotically consistent ML is asymptotically unbiased ML is asymptotically efficient (under true statistical model...) Estimation: Penalized-likelihood achieves uniform CR bound asymptotically Detection: Qi and Huesman showed analytically that MAP reconstruction outperforms FBP for SKE/BKE lesion detection (T-MI, Aug. 2001) Practical reasons Stability of estimates (if Ψ and algorithm chosen properly) Predictability of properties (despite nonlinearities) Empirical evidence (?) 2.27

67 Bayesian Framework Given a prior distribution p(x) for image vectors x, by Bayes rule: so posterior: p(x y) = p(y x) p(x)/ p(y) logp(x y) = logp(y x)+logp(x) logp(y) log p(y x) corresponds to data mismatch term (likelihood) log p(x) corresponds to regularizing penalty function Maximum a posteriori (MAP) estimator: ˆx = argmax log p(x y) x Has certain optimality properties (provided p(y x) and p(x) are correct). Same form as Ψ 2.28

68 Choice 4.1: Data-Mismatch Term Options (for emission tomography): Negative log-likelihood of statistical model. Poisson emission case: L(x; y) = logp(y x) = n d ([Ax] i + r i ) y i log([ax] i + r i )+logy i! i=1 Ordinary (unweighted) least squares: n d i=1 1 2 (y i ˆr i [Ax] i ) 2 Data-weighted least squares: n d i=1 1 2 (y i ˆr i [Ax] i ) 2 / ˆσ 2 i, ˆσ 2 i = max ( y i + ˆr i,σmin) 2, (causes bias due to data-weighting). Reweighted least-squares: ˆσ 2 i = [Aˆx] i + ˆr i Model-weighted least-squares (nonquadratic, but convex!) n d 1 i=1 2 (y i ˆr i [Ax] i ) 2 /([Ax] i + ˆr i ) Nonquadratic cost-functions that are robust to outliers... Considerations Faithfulness to statistical model vs computation Ease of optimization (convex?, quadratic?) Effect of statistical modeling errors 2.29

69 Choice 4.2: Regularization Forcing too much data fit gives noisy images Ill-conditioned problems: small data noise causes large image noise Solutions: Noise-reduction methods True regularization methods Noise-reduction methods Modify the data Prefilter or denoise the sinogram measurements Extrapolate missing (e.g., truncated) data Modify an algorithm derived for an ill-conditioned problem Stop algorithm before convergence Run to convergence, post-filter Toss in a filtering step every iteration or couple iterations Modify update to dampen high-spatial frequencies 2.30

70 Noise-Reduction vs True Regularization Advantages of noise-reduction methods Simplicity (?) Familiarity Appear less subjective than using penalty functions or priors Only fiddle factors are # of iterations, or amount of smoothing Resolution/noise tradeoff usually varies with iteration (stop when image looks good - in principle) Changing post-smoothing does not require re-iterating Advantages of true regularization methods Stability (unique minimizer & convergence = initialization independence) Faster convergence Predictability Resolution can be made object independent Controlled resolution (e.g., spatially uniform, edge preserving) Start with decent image (e.g., FBP) = reach solution faster. 2.31

71 True Regularization Methods Redefine the problem to eliminate ill-conditioning, rather than patching the data or algorithm! Options Use bigger pixels (fewer basis functions) Visually unappealing Can only preserve edges coincident with pixel edges Results become even less invariant to translations Method of sieves (constrain image roughness) Condition number for pre-emission space can be even worse Lots of iterations Commutability condition rarely holds exactly in practice Degenerates to post-filtering in some cases Change cost function by adding a roughness penalty / prior Disadvantage: apparently subjective choice of penalty Apparent difficulty in choosing penalty parameters (cf. apodizing filter / cutoff frequency in FBP) 2.32

72 Penalty Function Considerations Computation Algorithm complexity Uniqueness of minimizer of Ψ(x) Resolution properties (edge preserving?) # of adjustable parameters Predictability of properties (resolution and noise) Choices separable vs nonseparable quadratic vs nonquadratic convex vs nonconvex 2.33

73 Separable Penalty Functions: Separable vs Nonseparable Identity norm: R(x) = 1 2 x Ix = n p j=1 x2 j/2 penalizes large values of x, but causes squashing bias Entropy: R(x) = n p j=1 x j logx j Gaussian prior with mean µ j, variance σ 2 j: R(x) = n p (x j µ j ) 2 j=1 2σ 2 j Gamma prior R(x) = n p j=1 p(x j,µ j,σ j ) where p(x,µ,σ) is Gamma pdf The first two basically keep pixel values from blowing up. The last two encourage pixels values to be close to prior means µ j. General separable form: R(x) = n p j=1 f j (x j ) Slightly simpler for minimization, but these do not explicitly enforce smoothness. The simplicity advantage has been overcome in newer algorithms. 2.34

74 Penalty Functions: Separable vs Nonseparable Nonseparable (partially couple pixel values) to penalize roughness x 1 x 2 x 3 x 4 x 5 Example R(x) = (x 2 x 1 ) 2 +(x 3 x 2 ) 2 +(x 5 x 4 ) 2 +(x 4 x 1 ) 2 +(x 5 x 2 ) R(x) = 1 R(x) = 6 R(x) = 10 Rougher images = greater R(x) 2.35

75 Roughness Penalty Functions First-order neighborhood and pairwise pixel differences: R(x) = n p 1 j=1 2 ψ(x j x k ) k N j N j neighborhood of jth pixel (e.g., left, right, up, down) ψ called the potential function Finite-difference approximation to continuous roughness measure: Z Z R( f( )) = f( r) 2 d r = x f( r) 2 + y f( r) 2 + z f( r) 2 d r. Second derivatives also useful: (More choices!) 2 x f( r) 2 f( r j+1 ) 2 f( r j )+ f( r j 1 ) r= r j R(x) = n p ψ(x j+1 2x j + x j 1 )+ j=1 2.36

76 Penalty Functions: General Form R(x) = ψ k ([Cx] k ) where [Cx] k = k n p c k j x j j=1 Example: x 1 x 2 x 3 x 4 x 5 Cx = x 1 x 2 x 3 x 4 x 5 = x 2 x 1 x 3 x 2 x 5 x 4 x 4 x 1 x 5 x 2 5 R(x) = ψ k ([Cx] k ) k=1 = ψ 1 (x 2 x 1 )+ψ 2 (x 3 x 2 )+ψ 3 (x 5 x 4 )+ψ 4 (x 4 x 1 )+ψ 5 (x 5 x 2 ) 2.37

77 Penalty Functions: Quadratic vs Nonquadratic R(x) = ψ k ([Cx] k ) k Quadratic ψ k If ψ k (t) = t 2 /2, then R(x) = 1 2 x C Cx, a quadratic form. Simpler optimization Global smoothing Nonquadratic ψ k Edge preserving More complicated optimization. (This is essentially solved in convex case.) Unusual noise properties Analysis/prediction of resolution and noise properties is difficult More adjustable parameters (e.g., δ) Example: Huber function. ψ(t) { t 2 /2, t δ δ t δ 2 /2, t > δ Example: Hyperbola function. ψ(t) δ 2 1+(t/δ)

78 Quadratic vs Non quadratic Potential Functions Parabola (quadratic) Huber, δ=1 Hyperbola, δ=1 ψ(t) t Lower cost for large differences = edge preservation 2.39

79 Edge-Preserving Reconstruction Example Phantom Quadratic Penalty Huber Penalty 2.40

80 More Edge Preserving Regularization Chlewicki et al., PMB, Oct. 2004: Noise reduction and convergence of Bayesian algorithms with blobs based on the Huber function and median root prior 2.41

81 Piecewise Constant Cartoon Objects 400 k space samples x true x "conj 2 0phase" 2 x pcg quad x true x pcg edge x "conj phase" x pcg quad x pcg edge

82 Total Variation Regularization Non-quadratic roughness penalty: Z f( r) d r [Cx] k k Uses magnitude instead of squared magnitude of gradient. Problem: is not differentiable. Practical solution: t 1+(t/δ) 2 (hyperbola!) 5 Potential functions 4 Total Variation Hyperbola, δ=0.2 Hyperbola, δ=1 ψ(t) t 2.43

83 Penalty Functions: Convex vs Nonconvex Convex Easier to optimize Guaranteed unique minimizer of Ψ (for convex negative log-likelihood) Nonconvex Greater degree of edge preservation Nice images for piecewise-constant phantoms! Even more unusual noise properties Multiple extrema More complicated optimization (simulated / deterministic annealing) Estimator ˆx becomes a discontinuous function of data Y Nonconvex examples broken parabola true median root prior: R(x) = n p (x j median j (x)) 2 j=1 median j (x) ψ(t) = min(t 2, t 2 max) where median j (x) is local median Exception: orthonormal wavelet threshold denoising via nonconvex potentials! 2.44

84 Potential Function ψ(t) Potential Functions Parabola (quadratic) Huber (convex) Broken parabola (non convex) δ= t = x x j k 2.45

85 Local Extrema and Discontinuous Estimators Ψ(x) x Small change in data = large change in minimizer ˆx. Using convex penalty functions obviates this problem. ˆx 2.46

86 Augmented Regularization Functions Replace roughness penalty R(x) with R(x b) + αr(b), where the elements of b (often binary) indicate boundary locations. Line-site methods Level-set methods Joint estimation problem: (ˆx, ˆb) = argmin Ψ(x, b), x,b Ψ(x, b) = L(x; y)+βr(x b)+αr(b). Example: b jk indicates the presence of edge between pixels j and k: R(x b) = n p j=1 Penalty to discourage too many edges (e.g.): k N j (1 b jk ) 1 2 (x j x k ) 2 R(b) = b jk. jk Can encourage local edge continuity May require annealing methods for minimization 2.47

87 Modified Penalty Functions R(x) = n p 1 j=1 2 w jk ψ(x j x k ) k N j Adjust weights {w jk } to Control resolution properties Incorporate anatomical side information (MR/CT) (avoid smoothing across anatomical boundaries) Recommendations Emission tomography: Begin with quadratic (nonseparable) penalty functions Consider modified penalty for resolution control and choice of β Use modest regularization and post-filter more if desired Transmission tomography (attenuation maps), X-ray CT consider convex nonquadratic (e.g., Huber) penalty functions choose δ based on attenuation map units (water, bone, etc.) choice of regularization parameter β remains nontrivial, learn appropriate values by experience for given study type 2.48

88 Choice 4.3: Constraints Nonnegativity Known support Count preserving Upper bounds on values e.g., maximum µ of attenuation map in transmission case Considerations Algorithm complexity Computation Convergence rate Bias (in low-count regions)

89 Open Problems Modeling Noise in a i j s (system model errors) Noise in ˆr i s (estimates of scatter / randoms) Statistics of corrected measurements Statistics of measurements with deadtime losses For PL or MAP reconstruction, Qi (MIC 2004) has derived a bound on system model errors relative to data noise. Cost functions Performance prediction for nonquadratic penalties Effect of nonquadratic penalties on detection tasks Choice of regularization parameters for nonquadratic regularization 2.50

90 Summary 1. Object parameterization: function f( r) vs vector x 2. System physical model: s i ( r) 3. Measurement statistical model Y i? 4. Cost function: data-mismatch / regularization / constraints Reconstruction Method Cost Function + Algorithm Naming convention criterion - algorithm : ML-EM, MAP-OSL, PL-SAGE, PWLS+SOR, PWLS-CG,

91 Part 3. Algorithms Method = Cost Function + Algorithm Outline Ideal algorithm Classical general-purpose algorithms Considerations: nonnegativity parallelization convergence rate monotonicity Algorithms tailored to cost functions for imaging Optimization transfer EM-type methods Poisson emission problem Poisson transmission problem Ordered-subsets / block-iterative algorithms Recent convergent versions 3.1

92 Why iterative algorithms? For nonquadratic Ψ, no closed-form solution for minimizer. For quadratic Ψ with nonnegativity constraints, no closed-form solution. For quadratic Ψ without constraints, closed-form solutions: PWLS: OLS: ˆx = argmin y Ax 2 W 1/2 + x Rx = [A W A+ R] 1 A W y x ˆx = argmin y Ax 2 = [A A] 1 A y x Impractical (memory and computation) for realistic problem sizes. A is sparse, but A A is not. All algorithms are imperfect. No single best solution. 3.2

93 General Iteration Projection Measurements Calibration... x (n) Iteration System Model Ψ x (n+1) Parameters Deterministic iterative mapping: x (n+1) = M (x (n) ) 3.3

94 Ideal Algorithm x argmin Ψ(x) x 0 (global minimizer) Properties stable and convergent {x (n) } converges to x if run indefinitely converges quickly {x (n) } gets close to x in just a few iterations globally convergent lim n x (n) independent of starting image x (0) fast requires minimal computation per iteration robust insensitive to finite numerical precision user friendly nothing to adjust (e.g., acceleration factors) parallelizable (when necessary) simple easy to program and debug flexible accommodates any type of system model (matrix stored by row or column, or factored, or projector/backprojector) Choices: forgo one or more of the above 3.4

95 Classic Algorithms Non-gradient based Exhaustive search Nelder-Mead simplex (amoeba) Converge very slowly, but work with nondifferentiable cost functions. Gradient based Gradient descent x (n+1) x (n) α Ψ ( x (n)) Choosing α to ensure convergence is nontrivial. Steepest descent x (n+1) x (n) α n Ψ ( x (n)) where α n argmin Ψ ( x (n) α Ψ ( x (n))) α Computing α n can be expensive. Limitations Converge slowly. Do not easily accommodate nonnegativity constraint. 3.5

96 Gradients & Nonnegativity - A Mixed Blessing Unconstrained optimization of differentiable cost functions: Ψ(x) = 0 when x = x A necessary condition always. A sufficient condition for strictly convex cost functions. Iterations search for zero of gradient. Nonnegativity-constrained minimization: Karush-Kuhn-Tucker conditions Ψ(x) x j x=x is { = 0, x j > 0 0, x j = 0 A necessary condition always. A sufficient condition for strictly convex cost functions. Iterations search for??? 0 = x j x j Ψ(x ) is a necessary condition, but never sufficient condition. 3.6

97 Karush-Kuhn-Tucker Illustrated Ψ(x) Inactive constraint Active constraint x 3.7

98 Why Not Clip Negatives? 3 2 WLS with Clipped Newton Raphson Nonnegative Orthant 1 x Newton-Raphson with negatives set to zero each iteration. Fixed-point of iteration is not the constrained minimizer! x 1 3.8

99 Newton-Raphson Algorithm x (n+1) = x (n) [ 2 Ψ ( x (n)) ] 1 Ψ ( x (n)) Advantage: Super-linear convergence rate (if convergent) Disadvantages: Requires twice-differentiable Ψ Not guaranteed to converge Not guaranteed to monotonically decrease Ψ Does not enforce nonnegativity constraint Computing Hessian 2 Ψ often expensive Impractical for image recovery due to matrix inverse General purpose remedy: bound-constrained Quasi-Newton algorithms 3.9

100 2nd-order Taylor series: Newton s Quadratic Approximation Ψ(x) φ(x; x (n) ) Ψ ( x (n)) + Ψ ( x (n)) (x x (n) )+ 1 2 (x x(n) ) T 2 Ψ ( x (n)) (x x (n) ) Set x (n+1) to the ( easily found) minimizer of this quadratic approximation: x (n+1) argmin φ(x; x (n) ) x = x (n) [ 2 Ψ ( x (n)) ] 1 Ψ ( x (n)) Can be nonmonotone for Poisson emission tomography log-likelihood, even for a single pixel and single ray: Ψ(x) = (x+r) ylog(x+r). 3.10

101 Nonmonotonicity of Newton-Raphson Ψ(x) 0.5 new 1 old 1.5 Log Likelihood Newton Parabola x 3.11

102 An algorithm is monotonic if Consideration: Monotonicity Ψ ( x (n+1)) Ψ ( x (n)), x (n). Three categories of algorithms: Nonmonotonic (or unknown) Forced monotonic (e.g., by line search) Intrinsically monotonic (by design, simplest to implement) Forced monotonicity Most nonmonotonic algorithms can be converted to forced monotonic algorithms by adding a line-search step: x temp M (x (n) ), d = x temp x (n) x (n+1) x (n) α n d (n) where α n argmin Ψ ( x (n) αd (n)) α Inconvenient, sometimes expensive, nonnegativity problematic. 3.12

103 Conjugate Gradient Algorithm Advantages: Fast converging (if suitably preconditioned) (in unconstrained case) Monotonic (forced by line search in nonquadratic case) Global convergence (unconstrained case) Flexible use of system matrix A and tricks Easy to implement in unconstrained quadratic case Highly parallelizable Disadvantages: Nonnegativity constraint awkward (slows convergence?) Line-search awkward in nonquadratic cases Highly recommended for unconstrained quadratic problems (e.g., PWLS without nonnegativity). Useful (but perhaps not ideal) for Poisson case too. 3.13

104 Consideration: Parallelization Simultaneous (fully parallelizable) update all pixels simultaneously using all data EM, Conjugate gradient, ISRA, OSL, SIRT, MART,... Block iterative (ordered subsets) update (nearly) all pixels using one subset of the data at a time OSEM, RBBI,... Row action update many pixels using a single ray at a time ART, RAMLA Pixel grouped (multiple column action) update some (but not all) pixels simultaneously a time, using all data Grouped coordinate descent, multi-pixel SAGE (Perhaps the most nontrivial to implement) Sequential (column action) update one pixel at a time, using all (relevant) data Coordinate descent, SAGE 3.14

105 Coordinate Descent Algorithm aka Gauss-Siedel, successive over-relaxation (SOR), iterated conditional modes (ICM) Update one pixel at a time, holding others fixed to their most recent values: ( x new j = argmin Ψ x1 new,...,x new j 1,x j,x old j+1,...,xn old p ), j = 1,...,n p x j 0 Advantages: Intrinsically monotonic Fast converging (from good initial image) Global convergence Nonnegativity constraint trivial Disadvantages: Requires column access of system matrix A Cannot exploit some tricks for A, e.g., factorizations Expensive arg min for nonquadratic problems Poorly parallelizable 3.15

106 Constrained Coordinate Descent Illustrated 2 Clipped Coordinate Descent Algorithm x x

107 Coordinate Descent - Unconstrained 2 Unconstrained Coordinate Descent Algorithm x x

108 Coordinate-Descent Algorithm Summary Recommended when all of the following apply: quadratic or nearly-quadratic convex cost function nonnegativity constraint desired precomputed and stored system matrix A with column access parallelization not needed (standard workstation) Cautions: Good initialization (e.g., properly scaled FBP) essential. (Uniform image or zero image cause slow initial convergence.) Must be programmed carefully to be efficient. (Standard Gauss-Siedel implementation is suboptimal.) Updates high-frequencies fastest = poorly suited to unregularized case Used daily in UM clinic for 2D SPECT / PWLS / nonuniform attenuation 3.18

109 Summary of General-Purpose Algorithms Gradient-based Fully parallelizable Inconvenient line-searches for nonquadratic cost functions Fast converging in unconstrained case Nonnegativity constraint inconvenient Coordinate-descent Very fast converging Nonnegativity constraint trivial Poorly parallelizable Requires precomputed/stored system matrix CD is well-suited to moderate-sized 2D problem (e.g., 2D PET), but poorly suited to large 2D problems (X-ray CT) and fully 3D problems Neither is ideal.... need special-purpose algorithms for image reconstruction! 3.19

110 Data-Mismatch Functions Revisited For fast converging, intrinsically monotone algorithms, consider the form of Ψ. WLS: Ł(x) = n d 1 i=1 2 w i(y i [Ax] i ) 2 = Emission Poisson (negative) log-likelihood: Ł(x) = n d i=1 n d i=1 n d h i ([Ax] i ), where h i (l) 1 i=1 2 w i(y i l) 2. ([Ax] i + r i ) y i log([ax] i + r i ) = where h i (l) (l + r i ) y i log(l + r i ). Transmission Poisson log-likelihood: ) ) Ł(x) = (b i e [Ax] i + r i y i log (b i e [Ax] i + r i = n d h i ([Ax] i ) i=1 where h i (l) (b i e l + r i ) y i log ( b i e l + r i ). n d h i ([Ax] i ) i=1 MRI, polyenergetic X-ray CT, confocal microscopy, image restoration,... All have same partially separable form. 3.20

111 General Imaging Cost Function General form for data-mismatch function: n d Ł(x) = h i ([Ax] i ) i=1 General form for regularizing penalty function: General form for cost function: Ψ(x) = Ł(x)+βR(x) = R(x) = ψ k ([Cx] k ) k n d i=1 h i ([Ax] i )+β ψ k ([Cx] k ) k Properties of Ψ we can exploit: summation form (due to independence of measurements) convexity of h i functions (usually) summation argument (inner product of x with ith row of A) Most methods that use these properties are forms of optimization transfer. 3.21

112 Optimization Transfer Illustrated Ψ(x) and φ (n) (x) Surrogate function Cost function x (n) x (n+1) 3.22

113 Optimization Transfer General iteration: x (n+1) = argmin φ ( x; x (n)) x 0 Monotonicity conditions (cost function Ψ decreases provided these hold): φ(x (n) ; x (n) ) = Ψ(x (n) ) x φ(x; x (n) ) = Ψ(x) (n) x=x x=x (n) (matched current value) (matched gradient) φ(x; x (n) ) Ψ(x) x 0 (lies above) These 3 (sufficient) conditions are satisfied by the Q function of the EM algorithm (and SAGE). The 3rd condition is not satisfied by the Newton-Raphson quadratic approximation, which leads to its nonmonotonicity. 3.23

114 Optimization Transfer in 2d 3.24

115 Optimization Transfer cf EM Algorithm E-step: choose surrogate function φ(x; x (n) ) M-step: minimize surrogate function Designing surrogate functions Easy to compute Easy to minimize Fast convergence rate x (n+1) = argmin φ ( x; x (n)) x 0 Often mutually incompatible goals... compromises 3.25

116 Convergence Rate: Slow High Curvature Small Steps Slow Convergence φ Φ Old New x 3.26

117 Convergence Rate: Fast Low Curvature Large Steps Fast Convergence φ Φ Old New x 3.27

118 Tool: Convexity Inequality g(x) x x 1 αx 1 +(1 α)x 2 x 2 g convex = g(αx 1 +(1 α)x 2 ) αg(x 1 )+(1 α)g(x 2 ) for α [0,1] More generally: α k 0 and k α k = 1 = g( k α k x k ) k α k g(x k ). Sum outside! 3.28

119 Example 1: Classical ML-EM Algorithm Negative Poisson log-likelihood cost function (unregularized): Ψ(x) = n d h i ([Ax] i ), h i (l) = (l + r i ) y i log(l + r i ). i=1 Intractable to minimize directly due to summation within logarithm. Clever trick due to De Pierro (let ȳy (n) i = [Ax (n) ] i + r i ): [ ]( ) n p n p ai j x (n) j x j [Ax] i = a i j x j = ȳy (n) j=1 j=1 ȳy (n) i x (n) i j Since the h i s are convex in Poisson emission model: ( [ ]( )) np ai j x (n) n j x p j h i ([Ax] i ) = h i ȳy (n) j=1 ȳy (n) i x (n) i j j=1 [ ai j x (n) j ȳy (n) i. ]h i ( x j ȳy (n) x (n) i j ) x j ȳy (n) x (n) i j n d Ψ(x) = h i ([Ax] i ) φ ( [ ( x; x (n)) n d n p ai j x (n) j ]h i=1 i=1 j=1 ȳy (n) i i Replace convex cost function Ψ(x) with separable surrogate function φ(x; x (n) ). ) 3.29

120 ML-EM Algorithm M-step E-step gave separable surrogate function: φ ( x; x (n)) = M-step separates: Minimizing: n p ( φ j x j ; x (n)) (, where φ j x j ; x (n)) j=1 x (n+1) = argmin φ ( x; x (n)) = x (n+1) j x 0 x j φ j ( x j ; x (n)) = Solving (in case r i = 0): x (n+1) j n d ( a i j ḣ i ȳy (n) i i=1 = x (n) j [ nd i=1 a i j ) x j /x (n) j y i [Ax (n) ] i ] n d i=1 ( = argmin φ j x j ; x (n)), x j 0 / = n d a i j [1 i=1 [ ( ai j x (n) j ]h ȳy (n) i i ȳy (n) i y i x j /x (n) j ] ( nd a i j ), j = 1,...,n p i=1 x j ȳy (n) x (n) i j ) j = 1,...,n p x j =x (n+1) j. = 0. Derived without any statistical considerations, unlike classical EM formulation. Uses only convexity and algebra. Guaranteed monotonic: surrogate function φ satisfies the 3 required properties. M-step trivial due to separable surrogate. 3.30

121 ML-EM is Scaled Gradient Descent x (n+1) j = x (n) j = x (n) j = x (n) j [ nd i=1 + x (n) j a i j y i ȳy (n) i [ nd i=1 ] ( x (n) j n d i=1 a i j / a i j ( ) ( ) nd a i j i=1 )] y i ȳy (n) i 1 x j Ψ ( x (n)), / ( ) nd a i j i=1 j = 1,...,n p x (n+1) = x (n) + D(x (n) ) Ψ ( x (n)) This particular diagonal scaling matrix remarkably ensures monotonicity, ensures nonnegativity. 3.31

122 Consideration: Separable vs Nonseparable 2 Separable 2 Nonseparable 1 1 x2 0 x x x 1 Contour plots: loci of equal function values. Uncoupled vs coupled minimization. 3.32

123 Separable Surrogate Functions (Easy M-step) The preceding EM derivation structure applies to any cost function of the form Ψ(x) = n d h i ([Ax] i ). i=1 cf ISRA (for nonnegative LS), convex algorithm for transmission reconstruction Derivation yields a separable surrogate function Ψ(x) φ ( x; x (n)), where φ ( x; x (n)) = n p ( φ j x j ; x (n)) j=1 M-step separates into 1D minimization problems (fully parallelizable): x (n+1) = argmin φ ( x; x (n)) = x (n+1) j x 0 ( = argmin φ j x j ; x (n)), x j 0 j = 1,...,n p Why do EM / ISRA / convex-algorithm / etc. converge so slowly? 3.33

124 Separable vs Nonseparable Ψ Ψ φ φ Separable Nonseparable Separable surrogates (e.g., EM) have high curvature... slow convergence. Nonseparable surrogates can have lower curvature... faster convergence. Harder to minimize? Use paraboloids (quadratic surrogates). 3.34

125 High Curvature of EM Surrogate 2 Log Likelihood EM Surrogates hi(l) and Q(l;l n ) l 3.35

126 1D Parabola Surrogate Function Find parabola q (n) i (l) of the form: ( ( ) 1 q (n) i (l) = h i l (n) i )+ḣ i l (n) i (l l (n) i )+c (n) i (l l(n) i ) 2, where l (n) i [Ax (n) ] i 2 Satisfies tangent condition. Choose curvature to ensure lies above condition: { } c (n) i min c 0 : q (n) i (l) h i (l), l 0. Surrogate Functions for Emission Poisson Negative log likelihood Parabola surrogate function EM surrogate function Cost function values l l (n) i l Lower curvature! 3.36

127 Paraboloidal Surrogate Combining 1D parabola surrogates yields paraboloidal surrogate: n d Ψ(x) = h i ([Ax] i ) φ ( x; x (n)) n d = q (n) i ([Ax] i ) i=1 i=1 Rewriting: φ ( δ+ x (n) ; x (n)) = Ψ ( x (n)) + Ψ ( x (n)) δ+ 1 { 2 δ A diag Advantages Surrogate φ(x; x (n) ) is quadratic, unlike Poisson log-likelihood = easier to minimize Not separable (unlike EM surrogate) Not self-similar (unlike EM surrogate) Small curvatures = fast convergence Intrinsically monotone global convergence Fairly simple to derive / implement c (n) i } Aδ Quadratic minimization Coordinate descent + fast converging + Nonnegativity easy - precomputed column-stored system matrix Gradient-based quadratic minimization methods - Nonnegativity inconvenient 3.37

128 Example: PSCD for PET Transmission Scans square-pixel basis strip-integral system model shifted-poisson statistical model edge-preserving convex regularization (Huber) nonnegativity constraint inscribed circle support constraint paraboloidal surrogate coordinate descent (PSCD) algorithm 3.38

129 Separable Paraboloidal Surrogate To derive a parallelizable algorithm apply another De Pierro trick: n p [ ] ai j [Ax] i = π i j (x j x (n) j )+l (n) i, l (n) i = [Ax (n) ] i. j=1 π i j Provided π i j 0 and n p j=1 π i j = 1, since parabola q i is convex: ( np [ ] ) n p q (n) i ([Ax] i ) = q (n) ai j i π i j (x j x (n) j )+l (n) i π i j q (n) i j=1 π i j... φ ( x; x (n)) = n d i=1 q (n) i ([Ax] i ) φ ( x; x (n)) Separable Paraboloidal Surrogate: φ ( x; x (n)) = n p ( φ j x j ; x (n)), j=1 n d i=1 φ j ( x j ; x (n)) j=1 n p π i j q (n) i j=1 Parallelizable M-step (cf gradient descent!): [ ( x (n+1) j = argmin φ j x j ; x (n)) = x (n) j 1 Ψ ( ] x (n)) x j 0 d (n) x j j Natural choice is π i j = a i j / a i, a i = n p j=1 a i j ( ai j ( ai j (x j x (n) j )+l (n) i π i j (x j x (n) j )+l (n) i π i j ) n d ( ) π i j q (n) ai j i (x j x (n) j )+l (n) i i=1 π i j +, d (n) j = n d a 2 i j c (n) i i=1 π i j ) 3.39

130 Example: Poisson ML Transmission Problem Transmission negative log-likelihood (for ith ray): h i (l) = (b i e l + r i ) y i log ( b i e l + r i ). Optimal (smallest) parabola surrogate curvature (Erdoğan, T-MI, Sep. 1999): [ ] h(0) h(l)+ḣ(l)l 2, l > 0 c (n) i = c(l (n) i,h i ), c(l,h) = [ḧ(l) ] l 2 + +, l = 0. Separable Paraboloidal Surrogate (SPS) Algorithm: Precompute a i = n p j=1 a i j, i = 1,...,n d x (n+1) j = [ x (n) j 1 d (n) j l (n) i = [Ax (n) ] i, (forward projection) = b i e l(n) i + r i (predicted means) = 1 y i /ȳy (n) i (slopes) = c(l (n) i,h i ) (curvatures) Ψ ( ] [ ] x (n)) = x (n) j n d i=1 a i j ḣi (n) x j ȳy (n) i (n) ḣ i c (n) i + n d i=1 a i j a i c (n) i Monotonically decreases cost function each iteration. +, j = 1,...,n p No logarithm! 3.40

131 The MAP-EM M-step Problem Add a penalty function to our surrogate for the negative log-likelihood: Ψ(x) = Ł(x)+βR(x) φ ( x; x (n)) n p ( = φ j x j ; x (n)) +βr(x) j=1 M-step: x (n+1) = argmin φ ( x; x (n)) = argmin x 0 x 0 n p ( φ j x j ; x (n)) +βr(x) =? j=1 For nonseparable penalty functions, the M-step is coupled... difficult. Suboptimal solutions Generalized EM (GEM) algorithm (coordinate descent on φ) Monotonic, but inherits slow convergence of EM. One-step late (OSL) algorithm (use outdated gradients) (Green, T-MI, 1990) x j φ(x; x (n) ) = x j φ j (x j ; x (n) )+β x j R(x)? x j φ j (x j ; x (n) )+β x j R(x (n) ) Nonmonotonic. Known to diverge, depending on β. Temptingly simple, but avoid! Contemporary solution Use separable surrogate for penalty function too (De Pierro, T-MI, Dec. 1995) Ensures monotonicity. Obviates all reasons for using OSL! 3.41

132 De Pierro s MAP-EM Algorithm Apply separable paraboloidal surrogates to penalty function: n p R(x) R SPS (x; x (n) ) = R j (x j ; x (n) ) j=1 Overall separable surrogate: φ ( x; x (n)) = The M-step becomes fully parallelizable: x (n+1) j n p ( φ j x j ; x (n)) +β j=1 ( = argmin φ j x j ; x (n)) βr j (x j ; x (n) ), j = 1,...,n p. x j 0 Consider quadratic penalty R(x) = k ψ([cx] k ), where ψ(t) = t 2 /2. If γ k j 0 and n p j=1 γ k j = 1 then n p [ ck j [Cx] k = γ k j (x j x (n) j )+[Cx (n) ] k ]. j=1 γ k j Since ψ is convex: ψ([cx] k ) = ψ ( np [ ck j γ k j (x j x (n) j=1 γ k j n p γ k j ψ j=1 ( ck j j )+[Cx (n) ] k ] ) γ k j (x j x (n) j )+[Cx (n) ] k ) n p R j (x j ; x (n) ) j=1 3.42

133 De Pierro s Algorithm Continued So R(x) R(x; x (n) ) n p j=1 R j(x j ; x (n) ) where R j (x j ; x (n) ) γ k j ψ k ( ck j γ k j (x j x (n) j )+[Cx (n) ] k ) M-step: Minimizing φ j (x j ; x (n) )+βr j (x j ; x (n) ) yields the iteration: x (n+1) j = where B j 1 2 and ȳy (n) i = [Ax (n) ] i + r i. B j + [ nd i=1 x (n) j n d i=1 a i jy i /ȳy (n) i ( B 2j + x (n) j n d i=1 a i jy i /ȳy (n) i ( a i j + β c k j [Cx (n) ] k c2 k j x (n) j k γ k j )( ) β k c 2 k j /γ k j )], j = 1,...,n p Advantages: Intrinsically monotone, nonnegativity, fully parallelizable. Requires only a couple % more computation per iteration than ML-EM Disadvantages: Slow convergence (like EM) due to separable surrogate 3.43

134 Ordered Subsets Algorithms aka block iterative or incremental gradient algorithms The gradient appears in essentially every algorithm: Ł(x) = n d h i ([Ax] i ) = Ł(x) = i=1 x j n d a i j ḣ i ([Ax] i ). i=1 This is a backprojection of a sinogram of the derivatives { ḣ i ([Ax] i ) }. Intuition: with half the angular sampling, this backprojection would be fairly similar 1 n d where S is a subset of the rays. n d a i j ḣ i ( ) 1 i=1 S a i j ḣ i ( ), i S To OS-ize an algorithm, replace all backprojections with partial sums. Recall typical iteration: x (n+1) = x (n) D(x (n) ) Ψ ( x (n)). 3.44

135 Geometric View of Ordered Subsets n x f x 1 ( k ) Φ ( x k ) f2( x k ) f1( x n ) argmax f 1( x ) k x Φ ( x n ) f2( x n ) * x argmax f 2( x ) Two subset case: Ψ(x) = f 1 (x)+ f 2 (x) (e.g., odd and even projection views). For x (n) far from x, even partial gradients should point roughly towards x. For x (n) near x, however, Ψ(x) 0, so f 1 (x) f 2 (x) = cycles! Issues. Subset gradient balance : Ψ(x) M f k (x). Choice of ordering. 3.45

136 Incremental Gradients (WLS, 2 Subsets) f WLS (x) 10 2 f even (x) 10 2 f odd (x) 10 x true x 0 1 difference 5 difference 5 M=2 0 5 (full subset)

137 Subset Gradient Imbalance f WLS (x) 10 2 f 0 90 (x) 10 2 f (x) x 0 1 difference 5 difference 5 M=2 0 5 (full subset)

138 Non-monotone Does not converge (may cycle) Problems with OS-EM Byrne s rescaled block iterative (RBI) approach converges only for consistent (noiseless) data... unpredictable What resolution after n iterations? Object-dependent, spatially nonuniform What variance after n iterations? ROI variance? (e.g., for Huesman s WLS kinetics) 3.48

139 OSEM vs Penalized Likelihood image sinogram 10 6 counts 15% randoms/scatter uniform attenuation contrast in cold region within-region σ opposite side 3.49

140 Contrast-Noise Results OSEM 1 subset OSEM 4 subset OSEM 16 subset PL PSCA Noise (64 angles) Uniform image Contrast 3.50

141 1.5 Horizontal Profile OSEM 4 subsets, 5 iterations PL PSCA 10 iterations Relative Activity x

142 Making OS Methods Converge Relaxation Incrementalism Relaxed block-iterative methods M Ψ(x) = Ψ m (x) m=1 ( x (n+(m+1)/m) = x (n+m/m) α n D(x (n+m/m) ) Ψ m x ), (n+m/m) m = 0,...,M 1 Relaxation of step sizes: α n 0 as n, α n =, n ART RAMLA, BSREM (De Pierro, T-MI, 1997, 2001) Ahn and Fessler, NSS/MIC 2001, T-MI 2003 α 2 n < n Considerations Proper relaxation can induce convergence, but still lacks monotonicity. Choice of relaxation schedule requires experimentation. Ψ m (x) = Ł m (x)+ 1 M R(x), so each Ψ m includes part of the likelihood yet all of R 3.52

143 Relaxed OS-SPS x Penalized likelihood increase Original OS SPS Modified BSREM Relaxed OS SPS Iteration 3.53

144 Incremental Methods Incremental EM applied to emission tomography by Hsiao et al. as C-OSEM Incremental optimization transfer (Ahn & Fessler, MIC 2004) Find majorizing surrogate for each sub-objective function: φ m (x; x) = Ψ m (x), φ m (x; x) Ψ m (x), x x, x Define the following augmented cost function: F(x; x 1,..., x M ) = M m=1 φ m (x; x m ). Fact: by construction ˆx = argmin x Ψ(x) = argmin x min x 1,..., x M F(x; x 1,..., x M ). Alternating minimization: for m = 1,...,M: ( x new = argmin F x; x (n+1) 1,..., x (n+1) (n) m 1, x m, x (n) x x (n+1) m = argmin F x m m+1 ( x new ; x (n+1) 1,..., x (n+1) m 1, x m, x (n) m+1 ) (n),... x M ) (n),... x = x new. Use all current information, but increment the surrogate for only one subset. Monotone in F, converges under reasonable assumptions on Ψ In constrast, OS-EM uses the information from only one subset at a time M 3.54

145 TRIOT Example: Convergence Rate Transmission incremental optimization transfer (TRIOT) normalized Φ difference subsets, initialized with uniform image SPS MC SPS PC TRIOT MC TRIOT PC OS SPS 2 iterations of OS SPS included iteration 3.55

146 TRIOT Example: Attenuation Map Images FBP PL optimal image OS-SPS TRIOT-PC OS-SPS: 64 subsets, 20 iterations, one point of the limit cycle TRIOT-PC: 64 subsets, 20 iterations, after 2 iterations of OS-SPS) 3.56

147 OSTR aka Transmission OS-SPS Ordered subsets version of separable paraboloidal surrogates for PET transmission problem with nonquadratic convex regularization Matlab m-file fessler /code/transmission/tpl osps.m 3.57

148 Precomputed curvatures for OS-SPS Separable Paraboloidal Surrogate (SPS) Algorithm: [ ] x (n+1) j = x (n) j n d i=1 a i j ḣi([ax (n) ] i ) n d i=1 a, j = 1,...,n i j a i c (n) p i Ordered-subsets abandons monotonicity, so why use optimal curvatures c (n) i? + Precomputed curvature: c i = ḧ i (ˆl i ), ˆl i = argmin h i (l) l Precomputed denominator (saves one backprojection each iteration!): d j = OS-SPS algorithm with M subsets: [ x (n+1) j = n d a i j a i c i, j = 1,...,n p. i=1 x (n) j i S (n) a i j ḣ i ([Ax (n) ] i ) d j /M ] +, j = 1,...,n p 3.58

149 Summary of Algorithms General-purpose optimization algorithms Optimization transfer for image reconstruction algorithms Separable surrogates = high curvatures = slow convergence Ordered subsets accelerate initial convergence require relaxation or incrementalism for true convergence Principles apply to emission and transmission reconstruction Still work to be done... An Open Problem Still no algorithm with all of the following properties: Nonnegativity easy Fast converging Intrinsically monotone global convergence Accepts any type of system matrix Parallelizable 3.59

150 Spatial resolution properties Noise properties Detection properties Part 4. Performance Characteristics 4.1

151 Choosing β can be painful, so... For true minimization methods: Spatial Resolution Properties ˆx = argmin Ψ(x) x the local impulse response is approximately (Fessler and Rogers, T-MI, 1996): l j (x) = lim δ 0 E[ˆx x+δe j ] E[ˆx x] δ [ 20 Ψ ] 1 11 Ψ x j ȳy(x). Depends only on chosen cost function and statistical model. Independent of optimization algorithm (if iterated to convergence ). Enables prediction of resolution properties (provided Ψ is minimized) Useful for designing regularization penalty functions with desired resolution properties. For penalized likelihood: l j (x) [A W A+βR] 1 A W Ax true. Helps choose β for desired spatial resolution 4.2

152 Modified Penalty Example, PET a) b) c) d) e) a) filtered backprojection b) Penalized unweighted least-squares c) PWLS with conventional regularization d) PWLS with certainty-based penalty [36] e) PWLS with modified penalty [183] 4.3

153 Modified Penalty Example, SPECT - Noiseless Target filtered object FBP Conventional PWLS Truncated EM Post-filtered EM Modified Regularization 4.4

154 Modified Penalty Example, SPECT - Noisy Target filtered object FBP Conventional PWLS Truncated EM Post-filtered EM Modified Regularization 4.5

155 Regularized vs Post-filtered, with Matched PSF Pixel Standard Deviation (%) Noise Comparisons at the Center Pixel Uniformity Corrected FBP Penalized Likelihood Post Smoothed ML Target Image Resolution (mm) 4.6

156 Reconstruction Noise Properties For unconstrained (converged) minimization methods, the estimator is implicit: What is Cov{ˆx}? ˆx = ˆx(y) = argmin Ψ(x, y). x New simpler derivation. Denote the column gradient by g(x, y) x Ψ(x, y). Ignoring constraints, the gradient is zero at the minimizer: g(ˆx(y), y) = 0. First-order Taylor series expansion: Equating to zero: g(ˆx, y) g(x true, y)+ x g(x true, y)(ˆx x true ) = g(x true, y)+ 2 x Ψ ( x true, y ) (ˆx x true ). ˆx x true [ 2 x Ψ ( x true, y )] 1 x Ψ ( x true, y ). If the Hessian 2 Ψ is weakly dependent on y, then Cov{ˆx} [ 2 x Ψ ( x true,ȳy )] 1 Cov { x Ψ ( x true, y )}[ 2 x Ψ ( x true,ȳy )] 1. If we further linearize w.r.t. the data: g(x, y) g(x,ȳy)+ y g(x,ȳy)(y ȳy), then Cov{ˆx} [ 2 x Ψ ] 1 ( x y Ψ) Cov{y} ( x y Ψ) [ 2 x Ψ ]

157 Covariance approximation: Covariance Continued Cov{ˆx} [ 2 x Ψ ( x true,ȳy )] 1 Cov { x Ψ ( x true, y )}[ 2 x Ψ ( x true,ȳy )] 1 Depends only on chosen cost function and statistical model. Independent of optimization algorithm. Enables prediction of noise properties Can make variance images Useful for computing ROI variance (e.g., for weighted kinetic fitting) Good variance prediction for quadratic regularization in nonzero regions Inaccurate for nonquadratic penalties, or in nearly-zero regions 4.8

158 Qi and Huesman s Detection Analysis SNR of MAP reconstruction > SNR of FBP reconstruction (T-MI, Aug. 2001) quadratic regularization SKE/BKE task prewhitened observer non-prewhitened observer Open issues Choice of regularizer to optimize detectability? Active work in several groups. (e.g., 2004 MIC poster by Yendiki & Fessler.) 4.9

159 Part 5. Miscellaneous Topics (Pet peeves and more-or-less recent favorites) Short transmission scans 3D PET options OSEM of transmission data (ugh!) Precorrected PET data Transmission scan problems List-mode EM List of other topics I wish I had time to cover

160 PET Attenuation Correction (J. Nuyts) 5.2

161 Iterative reconstruction for 3D PET Fully 3D iterative reconstruction Rebinning / 2.5D iterative reconstruction Rebinning / 2D iterative reconstruction PWLS OSEM with attenuation weighting 3D FBP Rebinning / FBP 5.3

162 OSEM of Transmission Data? Bai and Kinahan et al. Post-injection single photon transmission tomography with ordered-subset algorithms for whole-body PET imaging 3D penalty better than 2D penalty OSTR with 3D penalty better than FBP and OSEM standard deviation from a single realization to estimate noise can be misleading Using OSEM for transmission data requires taking logarithm, whereas OSTR does not. 5.4

163 Precorrected PET data C. Michel examined shifted-poisson model, weighted OSEM of various flavors. concluded attenuation weighting matters especially 5.5

164 Transmission Scan Challenges Overlapping-beam transmission scans Polyenergetic X-ray CT scans Sourceless attenuation correction All can be tackled with optimization transfer methods. 5.6

165 List-mode EM x (n+1) j = x (n) j [ nd i=1 a i j y i ȳy (n) i ] / ( ) nd a i j i=1 = x(n) j n d i=1 a i j i:y i 0 a i j y i ȳy (n) i Useful when n d i=1 y i n d i=1 1 Attenuation and scatter non-trivial Computing a i j on-the-fly Computing sensitivity n d i=1 a i j still painful List-mode ordered-subsets is naturally balanced 5.7

166 Misc 4D regularization (reconstruction of dynamic image sequences) Sourceless attenuation-map estimation Post-injection transmission/emission reconstruction µ-value priors for transmission reconstruction Local errors in ˆµ propagate into emission image (PET and SPECT) 5.8

167 Summary Predictability of resolution / noise and controlling spatial resolution argues for regularized cost function todo: Still work to be done

Part 3. Algorithms. Method = Cost Function + Algorithm

Part 3. Algorithms. Method = Cost Function + Algorithm Part 3. Algorithms Method = Cost Function + Algorithm Outline Ideal algorithm Classical general-purpose algorithms Considerations: nonnegativity parallelization convergence rate monotonicity Algorithms