Wavelet Methods for Time Series Analysis. Part III: Wavelet-Based Signal Extraction and Denoising

Similar documents
Wavelet Methods for Time Series Analysis. Part II: Wavelet-Based Statistical Analysis of Time Series

Wavelet Methods for Time Series Analysis. Wavelet-Based Signal Estimation: I. Part VIII: Wavelet-Based Signal Extraction and Denoising

Wavelet Methods for Time Series Analysis

Signal Denoising with Wavelets

Defining the Discrete Wavelet Transform (DWT)

Examples of DWT & MODWT Analysis: Overview

Wavelet Methods for Time Series Analysis. Quick Comparison of the MODWT to the DWT

Wavelet Methods for Time Series Analysis. Quick Comparison of the MODWT to the DWT

MULTI-SCALE IMAGE DENOISING BASED ON GOODNESS OF FIT (GOF) TESTS

Wavelet Footprints: Theory, Algorithms, and Applications

Wavelet-Based Bootstrapping for Non-Gaussian Time Series. Don Percival

Wavelet Methods for Time Series Analysis. Part IV: Wavelet-Based Decorrelation of Time Series

Design of Image Adaptive Wavelets for Denoising Applications

A New Smoothing Model for Analyzing Array CGH Data

Sequential Procedure for Testing Hypothesis about Mean of Latent Gaussian Process

Lecture Notes 5: Multiresolution Analysis

Image Noise: Detection, Measurement and Removal Techniques. Zhifei Zhang

X t = a t + r t, (7.1)

An Introduction to Wavelets and some Applications

Machine Learning, Fall 2009: Midterm

Low-Complexity Image Denoising via Analytical Form of Generalized Gaussian Random Vectors in AWGN

Image Denoising using Uniform Curvelet Transform and Complex Gaussian Scale Mixture

Wavelet Analysis of Discrete Time Series

Lecture Notes 1: Vector spaces

Multivariate Bayes Wavelet Shrinkage and Applications

A Probability Review

Wavelet Based Image Restoration Using Cross-Band Operators

ECE 541 Stochastic Signals and Systems Problem Set 9 Solutions

18 Bivariate normal distribution I

An Introduction to Wavelets with Applications in Environmental Science

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

IMPROVEMENTS IN MODAL PARAMETER EXTRACTION THROUGH POST-PROCESSING FREQUENCY RESPONSE FUNCTION ESTIMATES

MMSE Denoising of 2-D Signals Using Consistent Cycle Spinning Algorithm

Peter Hoff Linear and multilinear models April 3, GLS for multivariate regression 5. 3 Covariance estimation for the GLM 8

Notes on Random Variables, Expectations, Probability Densities, and Martingales

TECHNICAL REPORT NO. 1091r. A Note on the Lasso and Related Procedures in Model Selection

Noise Reduction for Enhanced Component Identification in Multi-Dimensional Biomolecular NMR Studies

OPTIMAL SURE PARAMETERS FOR SIGMOIDAL WAVELET SHRINKAGE

Data Mining Stat 588

ELEG 3143 Probability & Stochastic Process Ch. 6 Stochastic Process

University of Cambridge Engineering Part IIB Module 3F3: Signal and Pattern Processing Handout 2:. The Multivariate Gaussian & Decision Boundaries

An Overview of Sparsity with Applications to Compression, Restoration, and Inverse Problems

Computational Genomics

Statistical signal processing

A Tutorial on Stochastic Models and Statistical Analysis for Frequency Stability Measurements

Modeling Multiscale Differential Pixel Statistics

STAT 100C: Linear models

Edge-preserving wavelet thresholding for image denoising

P (x). all other X j =x j. If X is a continuous random vector (see p.172), then the marginal distributions of X i are: f(x)dx 1 dx n

The Lifting Wavelet Transform for Periodogram Smoothing

EXTENDING THE SCOPE OF WAVELET REGRESSION METHODS BY COEFFICIENT-DEPENDENT THRESHOLDING

CMSC858P Supervised Learning Methods

Dependence. MFM Practitioner Module: Risk & Asset Allocation. John Dodson. September 11, Dependence. John Dodson. Outline.

Which wavelet bases are the best for image denoising?

Chapter 17: Undirected Graphical Models

Vector spaces. DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis.

Wavelet de-noising for blind source separation in noisy mixtures.

Consider the joint probability, P(x,y), shown as the contours in the figure above. P(x) is given by the integral of P(x,y) over all values of y.

Expectation Maximization

Review. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

NOISE ROBUST RELATIVE TRANSFER FUNCTION ESTIMATION. M. Schwab, P. Noll, and T. Sikora. Technical University Berlin, Germany Communication System Group

Robust Wavelet Denoising

Association studies and regression

Piecewise Linear Approximations of Nonlinear Deterministic Conditionals in Continuous Bayesian Networks

Denosing Using Wavelets and Projections onto the l 1 -Ball

Statistical learning. Chapter 20, Sections 1 4 1

ECE534, Spring 2018: Solutions for Problem Set #3

The Multivariate Gaussian Distribution [DRAFT]

For a stochastic process {Y t : t = 0, ±1, ±2, ±3, }, the mean function is defined by (2.2.1) ± 2..., γ t,

Chris Bishop s PRML Ch. 8: Graphical Models

Stochastic Subgradient Methods

Approximate Message Passing Algorithms

Wavelet-Based Statistical Signal Processing. Using Hidden Markov Models. Rice University.

Discrete Mathematics and Probability Theory Spring 2016 Rao and Walrand Note 26. Estimation: Regression and Least Squares

Time Series 2. Robert Almgren. Sept. 21, 2009

If we want to analyze experimental or simulated data we might encounter the following tasks:

Smooth functions and local extreme values

Perhaps the simplest way of modeling two (discrete) random variables is by means of a joint PMF, defined as follows.

SIO 221B, Rudnick adapted from Davis 1. 1 x lim. N x 2 n = 1 N. { x} 1 N. N x = 1 N. N x = 1 ( N N x ) x = 0 (3) = 1 x N 2

[y i α βx i ] 2 (2) Q = i=1

Section 8.1. Vector Notation

2D Wavelets. Hints on advanced Concepts

CS281 Section 4: Factor Analysis and PCA

8.1 Concentration inequality for Gaussian random matrix (cont d)

FORMULATION OF THE LEARNING PROBLEM

Exam: high-dimensional data analysis January 20, 2014

Y i = η + ɛ i, i = 1,...,n.

Denoising via Recursive Wavelet Thresholding. Alyson Kerry Fletcher. A thesis submitted in partial satisfaction of the requirements for the degree of

University of Cambridge. MPhil in Computer Speech Text & Internet Technology. Module: Speech Processing II. Lecture 2: Hidden Markov Models I

Wavelet Based Image Denoising Technique

5 Operations on Multiple Random Variables

Lecture 13: Data Modelling and Distributions. Intelligent Data Analysis and Probabilistic Inference Lecture 13 Slide No 1

A Data-Driven Block Thresholding Approach To Wavelet Estimation

Comments on \Wavelets in Statistics: A Review" by. A. Antoniadis. Jianqing Fan. University of North Carolina, Chapel Hill

MACHINE LEARNING ADVANCED MACHINE LEARNING

A review of probability theory

A New Poisson Noisy Image Denoising Method Based on the Anscombe Transformation

Alpha-Investing. Sequential Control of Expected False Discoveries

Multiresolution Analysis

Nonparametric Methods

Transcription:

Wavelet Methods for Time Series Analysis Part III: Wavelet-Based Signal Extraction and Denoising overview of key ideas behind wavelet-based approach description of four basic models for signal estimation discussion of why wavelets can help estimate certain signals simple thresholding & shrinkage schemes for signal estimation wavelet-based thresholding and shrinkage discuss some extensions to basic approach III 1

Wavelet-Based Signal Estimation: I DWT analysis of X yields W = WX DWT synthesis X = W T W yields multiresolution analysis by splitting W T W into pieces associated with different scales DWT synthesis can also estimate signal hidden in X if we can modify W to get rid of noise in the wavelet domain if W is a noise reduced version of W, can form signal estimate via W T W WMTSA: 393 III 2

Wavelet-Based Signal Estimation: II key ideas behind simple wavelet-based signal estimation certain signals can be efficiently described by the DWT using all of the scaling coefficients a small number of large wavelet coefficients noise is manifested in a large number of small wavelet coefficients can either threshold or shrink wavelet coefficients to eliminate noise in the wavelet domain key ideas led to wavelet thresholding and shrinkage proposed by Donoho, Johnstone and coworkers in 199s WMTSA: 393 394 III 3

Models for Signal Estimation: I will consider two types of signals: 1. D, an N dimensional deterministic signal 2. C, an N dimensional stochastic signal; i.e., a vector of random variables (RVs) with covariance matrix Σ C will consider two types of noise: 1., an N dimensional vector of independent and identically distributed (IID) RVs with mean and covariance matrix Σ = σ 2 I N 2. η, an N dimensional vector of non-iid RVs with mean and covariance matrix Ση one form: RVs independent, but have different variances another form of non-iid: RVs are correlated WMTSA: 393 394 III 4

Models for Signal Estimation: II leads to four basic signal + noise models for X 1. X = D + 2. X = D + η 3. X = C + 4. X = C + η in the latter two cases, the stochastic signal C is assumed to be independent of the associated noise WMTSA: 393 394 III 5

Signal Representation via Wavelets: I consider deterministic signals D first signal estimation problem is simplified if we can assume that the important part of D is in its large values assumption is not usually viable in the original (i.e., time domain) representation D, but might be true in another domain an orthonormal transform O might be useful because O = OD is equivalent to D (since D = O T O) we might be able to find O such that the signal is isolated in M ø N large transform coefficients Q: how can we judge whether a particular O might be useful for representing D? WMTSA: 394 III 6

Signal Representation via Wavelets: II let O j be the jth transform coefficient in O = OD let O (), O (1),..., O (N 1) be the O j s reordered by magnitude: O () O (1) O (N 1) example: if O = [ 3, 1, 4, 7, 2, 1] T, then O () = O 3 = 7, O (1) = O 2 = 4, O (2) = O = 3 etc. define a normalized partial energy sequence (NPES): P M 1 j= C M 1 O (j) 2 energy in largest M terms P N 1 j= O = (j) 2 total energy in signal let I M be N N diagonal matrix whose jth diagonal term is 1 if O j is one of the M largest magnitudes and is otherwise WMTSA: 394 395 III 7

Signal Representation via Wavelets: III form b D M O T I M O, which is an approximation to D when O = [ 3, 1, 4, 7, 2, 1] T and M = 3, we have 1 3 I 3 = 1 1 and thus D b M = O T 4 7 one interpretation for NPES: C M 1 = 1 kd b D M k 2 kdk 2 = 1 relative approximation error WMTSA: 394 395 III 8

Signal Representation via Wavelets: IV 1 D 1 D 2 D 3 1 64 128 64 128 64 128 t t t consider three signals plotted above D 1 is a sinusoid, which can be represented succinctly by the discrete Fourier transform (DFT) D 2 is a bump (only a few nonzero values in the time domain) D 3 is a linear combination of D 1 and D 2 WMTSA: 395 396 III 9

Signal Representation via Wavelets: V consider three different orthonormal transforms identity transform I (time) the orthonormal DFT F (frequency), where F has (k, t)th element exp( i2πtk/n)/ N for k, t N 1 the LA(8) DWT W (wavelet) # of terms M needed to achieve relative error < 1%: D 1 D 2 D 3 DFT 2 29 28 identity 15 9 75 LA(8) wavelet 22 14 21 WMTSA: 395 396 III 1

Signal Representation via Wavelets: VI 1 D 1 D 2 D 3 1 1..9 64 128 64 128 64 128 M M M use NPESs to see how well these three signals are represented in the time, frequency (DFT) and wavelet (LA(8)) domains time (solid curves), frequency (dotted) and wavelet (dashed) WMTSA: 395 396 III 11

Signal Representation via Wavelets: VII example: vertical ocean shear time series has frequency-domain fluctuations also has time-domain turbulent activity next overhead shows the signal D itself its approximation b D 1 from 1 LA(8) DWT coefficients b D 3 from 3 LA(8) DWT coefficients, giving C 299. =.9983 b D 3 from 3 DFT coefficients, giving C 299. =.9973 note that 3 coefficients is less than 5% of N = 6784! WMTSA: 396 397 III 12

Signal Representation via Wavelets: VIII 7 D 7 7 LA(8) b D 1 7 7 LA(8) b D 3 7 7 DFT b D 3 7 35 45 55 65 75 85 95 15 meters need 123 additional ODFT coefficients to match C 299 for DWT WMTSA: 396 397 III 13

Signal Representation via Wavelets: IX 4 DFT LA(8) DWT X 4 bd 5 4 bd 1 4 bd 2 4 bd 4 512 124 512 124 t t 2nd example: DFT b D M (left-hand column) & J = 6 LA(8) DWT b D M (right) for NMR series X (A. Maudsley, UCSF) WMTSA: 431 432 III 14

Signal Estimation via Thresholding: I assume model of deterministic signal plus IID noise: X = D + let O be an N N orthonormal matrix form O = OX = OD + O d + e component-wise, have O l = d l + e l define signal to noise ratio (SNR): kdk 2 E{k k 2 } = assume that SNR is large kdk2 E{kek 2 } = P N 1 l= d2 l P N 1 l= E{e2 l } assume that d has just a few large coefficients; i.e., large signal coefficients dominate O WMTSA: 398 III 15

Signal Estimation via Thresholding: II recall simple estimator D b M O T I M O and previous example: 1 O O O 1 bd M = O T 1 O 2 1 O 3 = O T O 2 O 3 O 4 O 5 let J m be a set of m indices corresponding to places where jth diagonal element of I m is 1 in example above, we have J 3 = {, 2, 3} strategy in forming b D M is to keep a coefficient O j if j J m but to replace it with if j 6 J m ( kill or keep strategy) WMTSA: 398 III 16

Signal Estimation via Thresholding: III can pose a simple optimization problem whose solution 1. is a kill or keep strategy (and hence justifies this strategy) 2. dictates that we use coefficients with the largest magnitudes 3. tells us what M should be (once we set a certain parameter) optimization problem: find b D M such that γ m kx b D m k 2 + mδ 2 is minimized over all possible I m, m =,..., N in the above δ 2 is a fixed parameter (set a priori) WMTSA: 398 III 17

Signal Estimation via Thresholding: IV kx b D m k 2 is a measure of fidelity rationale for this term: under our assumption of a high SNR, bd m shouldn t stray too far from X fidelity increases (the measure decreases) as m increases in minimizing γ m, consideration of this term alone suggests that m should be large mδ 2 is a penalty for too many terms rationale: heuristic says d has only a few large coefficients penalty increases as m increases in minimizing γ m, consideration of this term alone suggests that m should be small optimization problem: balance off fidelity & parsimony WMTSA: 398 III 18

Signal Estimation via Thresholding: V claim: γ m = kx b D m k 2 + mδ 2 is minimized when m is set to the number of coefficients O j such that O 2 j > δ2 proof of claim: since X = O T O & b D m O T I m O, have γ m = kx b D m k 2 + mδ 2 = ko T O O T I m Ok 2 + mδ 2 = ko T (I N I m )Ok 2 + mδ 2 = k(i N I m )Ok 2 + mδ 2 = X Oj 2 + X δ 2 j6 J m j J m for any given j, if j 6 J m, we contribute Oj 2 to first sum; on the other hand, if j J m, we contribute δ 2 to second sum to minimize γ m, we need to put j in J m if O 2 j > δ2, thus establishing the claim WMTSA: 398 III 19

Thresholding Functions: I more generally, thresholding schemes involve 1. computing O OX 2. defining O (t) as vector with lth element O (t) l = (, if O l δ; some nonzero value, where nonzero values are yet to be defined 3. estimating D via b D (t) O T O (t) otherwise, simplest scheme is hard thresholding ( kill/keep strategy): ( O (ht), if O l = l δ; O l, otherwise. WMTSA: 399 III 2

Thresholding Functions: II plot shows mapping from O l to O (ht) l 3δ 2δ δ O (ht) l δ 2δ 3δ 3δ 2δ δ δ 2δ 3δ O l WMTSA: 399 III 21

Thresholding Functions: III alternative scheme is soft thresholding: where sign {O l } O (st) l = sign {O l } ( O l δ) +, +1, if O l > ;, if O l = ; 1, if O l <. and (x) + ( x, if x ;, if x <. one rationale for soft thresholding is that it fits into Stein s class of estimators (will discuss later on) WMTSA: 399 4 III 22

Thresholding Functions: IV here is the mapping from O l to O (st) l 3δ 2δ δ O (st) l δ 2δ 3δ 3δ 2δ δ δ 2δ 3δ O l WMTSA: 399 4 III 23

Thresholding Functions: V third scheme is mid thresholding: where O (mt) l = sign {O l } ( O l δ) ++, ( O l δ) ++ ( 2( O l δ) +, if O l < 2δ; O l, otherwise provides compromise between hard and soft thresholding WMTSA: 399 4 III 24

Thresholding Functions: VI here is the mapping from O l to O (mt) l 3δ 2δ δ O (mt) l δ 2δ 3δ 3δ 2δ δ δ 2δ 3δ O l WMTSA: 399 4 III 25

Thresholding Functions: VII example of mid thresholding with δ = 1 3 X t 3 2 1 1 2 2 1 1 2 3 O l O (mt) l bd (mt) t 3 64 t or l III 26

Universal Threshold: I Q: how do we go about setting δ? specialize to IID Gaussian noise with covariance σ 2 I N can argue e O is also IID Gaussian with covariance σ 2 I N Donoho & Johnstone (1995) proposed δ (u) [2σ 2 log(n)] ( log here is log base e ) rationale for δ (u) : because of Gaussianity, can argue that 1 as N [4π log (N)] P max{ e l } > δ (u) l h max l { e l } δ (u)i 1 as N, so no noise and hence P will exceed threshold in the limit WMTSA: 4 42 III 27

Universal Threshold: II suppose D is a vector of zeros so that O l = e l implies that O (ht) = with high probability as N hence will estimate correct D with high probability critique of δ (u) : consider lots of IID Gaussian series, N = 128: only 13% will have any values exceeding δ (u) δ (u) is slanted toward eliminating vast majority of noise, but, if we use, e.g., hard thresholding, any nonzero signal transform coefficient of a fixed magnitude will eventually get set to as N nonetheless: δ (u) works remarkably well WMTSA: 4 42 III 28

Minimum Unbiased Risk: I second approach for setting δ is data-adaptive, but only works for selected thresholding functions assume model of deterministic signal plus non-iid noise: X = D + η so that O OX = OD + Oη d + n component-wise, have O l = d l + n l further assume that n l is an N (, σ 2 n l ) RV, where σ 2 n l is assumed to be known, but we allow the possibility that n l s are correlated let O (δ) l be estimator of d l based on a (yet to be determined) threshold δ want to make E{(O (δ) l d l ) 2 } as small as possible WMTSA: 42 43 III 29

Minimum Unbiased Risk: II Stein (1981) considered estimators restricted to be of the form O (δ) l = O l + A (δ) (O l ), where A (δ) ( ) must be weakly differentiable (basically, piecewise continuous plus a bit more) since O l = d l + n l, above yields O (δ) l d l = n l + A (δ) (O l ), so E{(O (δ) l d l ) 2 } = σ 2 n l + 2E{n l A (δ) (O l )} + E{[A (δ) (O l )] 2 } because of Gaussianity, can reduce middle term: ( E{n l A (δ) (O l )} = σn 2 d l E dx A(δ) (x) Ø Ø x=ol ) WMTSA: 42 44 III 3

Minimum Unbiased Risk: III practical scheme: given realizations o l of O l, find δ minimizing estimate of N 1 X E (O (δ) l d l ) 2, which, in view of l= E{(O (δ) l d l ) 2 } = σn 2 l +2σn 2 l E is N 1 X l= R(σ nl, o l, δ) N 1 X l= ( d dx A(δ) (x) Ø Ø x=ol ) +E{[A (δ) (O l )] 2 }, σ 2 n l + 2σ 2 n l d dx A(δ) (o l ) + [A (δ) (o l )] 2 for a given δ, above is Stein s unbiased risk estimator (SURE) WMTSA: 44 III 31

example: if we set Minimum Unbiased Risk: IV A (δ) (O l ) = ( O l, if O l < δ; δ sign{o l }, if O l δ, we obtain O (δ) l = O l + A (δ) (O l ) = O (st) l, i.e., soft thresholding for this case, can argue that R(σ nl, O l, δ) = O 2 l σ2 n l + (2σ 2 n l O 2 l + δ2 )1 [δ 2, ) (O2 l ), where 1 [δ 2, ) (x) ( 1, if δ 2 x < ;, otherwise. only the last term depends on δ, and, as a function of δ, SURE is minimized when last term is minimized WMTSA: 44 46 III 32

Minimum Unbiased Risk: V data-adaptive scheme is to replace O l with its realization, say o l, and to set δ equal to the value, say δ (S), minimizing N 1 X (2σn 2 l o 2 l + δ2 )1 [δ 2, ) (o2 l ), l= must have δ (S) = o l for some l, so minimization is easy if n l have a common variance, i.e., σn 2 l = σ 2 for all l, need to find minimizer of the following function of δ: N 1 X (2σ 2 o2 l + δ2 )1 [δ 2, ) (o2 l ), l= (in practice, σ 2 is usually unknown, so later on we will consider how to estimate this also) WMTSA: 44 46 III 33

Signal Estimation via Shrinkage so far, we have only considered signal estimation via thresholding rules, which will map some O l to zeros will now consider shrinkage rules, which differ from thresholding only in that nonzero coefficients are mapped to nonzero values rather than exactly zero (but values can be very close to zero!) there are three approaches that lead us to shrinkage rules 1. linear mean square estimation 2. conditional mean and median 3. Bayesian approach will only consider 1 and 2, but one form of Bayesian approach turns out to be identical to 2 III 34

Linear Mean Square Estimation: I assume model of stochastic signal plus non-iid noise: X = C + η so that O = OX = OC + Oη R + n component-wise, have O l = R l + n l assume C and η are multivariate Gaussian with covariance matrices Σ C and Ση implies R and n are also Gaussian RVs, but now with covariance matrices OΣ C O T and OΣηO T assume that E{R l } = for any component of interest and that R l & n l are uncorrelated suppose we estimate R l via a simple scaling of O l : br l a l O l, where a l is a constant to be determined WMTSA: 47 III 35

Linear Mean Square Estimation: II let us select a l by making E{(R l b R l ) 2 } as small as possible, which occurs when we set a l = E{R lo l } E{O 2 l } because R l and n l are uncorrelated with means and because O l = R l + n l, we have E{R l O l } = E{R 2 l } and E{O2 l } = E{R2 l } + E{n2 l }, yielding br l = E{R 2 l } E{R 2 l } + E{n2 l }O l = σ 2 R l σ 2 R l + σ 2 n l O l note: optimum a l shrinks O l toward zero, with shrinkage increasing as the noise variance increases WMTSA: 47 48 III 36

Background on Conditional PDFs: I let X and Y be RVs with probability density functions (PDFs) f X ( ) and f Y ( ) let f X,Y (x, y) be their joint PDF at the point (x, y) f X ( ) and f Y ( ) are called marginal PDFs and can be obtained from the joint PDF via integration: f X (x) = Z f X,Y (x, y) dy the conditional PDF of Y given X = x is defined as f Y X=x (y) = f X,Y (x, y) f X (x) (read as given or conditional on ) WMTSA: 258 26 III 37

Background on Conditional PDFs: II by definition RVs X and Y are said to be independent if in which case f X,Y (x, y) = f X (x)f Y (y), f Y X=x (y) = f X,Y (x, y) f X (x) = f X(x)f Y (y) f X (x) = f Y (y) thus X and Y are independent if knowing X doesn t allow us to alter our probabilistic description of Y f Y X=x ( ) is a PDF, so its mean value is E{Y X = x} = Z yf Y X=x (y) dy; the above is called the conditional mean of Y, given X WMTSA: 26 III 38

Background on Conditional PDFs: III suppose RVs X and Y are related, but we can only observe X suppose we want to approximate the unobservable Y based on some function of the observable X example: we observe part of a time series containing a signal buried in noise, and we want to approximate the unobservable signal component based upon a function of what we observed suppose we want our approximation to be the function of X, say U 2 (X), such that the mean square difference between Y and U 2 (X) is as small as possible; i.e., we want to be as small as possible E{(Y U 2 (X)) 2 } WMTSA: 26 261 III 39

Background on Conditional PDFs: IV solution is to use U 2 (X) = E{Y X}; i.e., the conditional mean of Y given X is our best guess at Y in the sense of minimizing the mean square error (related to fact that E{(Y a) 2 } is smallest when a = E{Y }) on the other hand, suppose we want the function U 1 (X) such that the mean absolute error E{ Y U 1 (X) } is as small as possible the solution now is to let U 1 (X) be the conditional median; i.e., we must solve Z U1 (x) f Y X=x (y) dy =.5 to figure out what U 1 (x) should be when X = x WMTSA: 26 261 III 4

Conditional Mean and Median Approach: I assume model of stochastic signal plus non-iid noise: X = C + η so that O = OX = OC + Oη R + n component-wise, have O l = R l + n l because C and η are independent, R and n must be also suppose we approximate R l via b R l U 2 (O l ), where U 2 (O l ) is selected to minimize E{(R l U 2 (O l )) 2 } solution is to set U 2 (O l ) equal to E{R l O l }, so let s work out what form this conditional mean takes to get E{R l O l }, need the PDF of R l given O l, which is f Rl O l =o l (r l ) = f R l,o l (r l, o l ) f Ol (o l ) WMTSA: 48 49 III 41

Conditional Mean and Median Approach: II joint PDF of R l and O l related to the joint PDF f Rl,n l (, ) of R l and n l via f Rl,O l (r l, o l ) = f Rl,n l (r l, o l r l ) = f Rl (r l )f nl (o l r l ), with the 2nd equality following since R l & n l are independent marginal PDF for O l can be obtained from joint PDF f Rl,O l (, ) by integrating out the first argument: f Ol (o l ) = Z f Rl,O l (r l, o l ) dr l = Z f Rl (r l )f nl (o l r l ) dr l putting all these pieces together yields the conditional PDF f Rl O l =o l (r l ) = f R l,o l (r l, o l ) f Rl (r l )f nl (o l r l ) = R f Ol (o l ) f Rl (r l )f nl (o l r l ) dr l WMTSA: 49 41 III 42

Conditional Mean and Median Approach: III mean value of f Rl O l =o l ( ) yields estimator b R l = E{R l O l }: E{R l O l = o l } = = Z r l f Rl O l =o l (r l ) dr l R r l f Rl (r l )f nl (o l r l )dr l R f Rl (r l )f nl (o l r l ) dr l to make further progress, we need a model for the waveletdomain representation R l of the signal heuristic that signal in the wavelet domain has a few large values and lots of small values suggests a Gaussian mixture model WMTSA: 41 III 43

Conditional Mean and Median Approach: IV let I l be an RV such that P [I l = 1] = p l & P [I l = ] = 1 p l under Gaussian mixture model, R l has same distribution as I l N (, γ 2 l σ2 G l ) + (1 I l )N (, σ 2 G l ) where N (, σ 2 ) is a Gaussian RV with mean and variance σ 2 2nd component models small # of large signal coefficients 1st component models large # of small coefficients (γ 2 l ø 1) example: PDFs for case σ 2 G l = 1, γ 2 l σ2 G l = 1 and p l =.75.4.3.2.1. 1 5 5 1 1 5 5 1 WMTSA: 41 III 44

Conditional Mean and Median Approach: V to complete model, let n l obey a Gaussian distribution with mean and variance σn 2 l conditional mean estimator of the signal RV R l is given by E{R l O l = o l } = a la l (o l ) + b l B l (o l ) o A l (o l ) + B l (o l ) l, where a l A l (o l ) B l (o l ) γl 2σ2 G l γl 2σ2 G + σ 2 and b l σ2 Gl l n l σg 2 + σ 2 l n l p l (2π[γ 2 l σg 2 + σ 2 l n l ]) e o2 l /[2(γ2 l σ2 G +σ 2 l n )] l 1 p l (2π[σ 2 Gl + σ 2 n l ]) e o2 l /[2(σ2 G l +σ 2 n l )] WMTSA: 41 411 III 45

Conditional Mean and Median Approach: VI let s simplify to a sparse signal model by setting γ l = ; i.e., large # of small coefficients are all zero distribution for R l same as (1 I l )N (, σ 2 G l ) conditional mean estimator becomes E{R l O l = o l } = where c l = p l (σ 2 Gl + σn 2 l ) e o2 l b l/(2σn 2 ) l (1 p l )σ nl b l 1+c l o l, WMTSA: 411 III 46

Conditional Mean and Median Approach: VII 6 3 E{R l O l = o l } 3 6 6 3 3 6 o l conditional mean shrinkage rule for p l =.95 (i.e., 95% of signal coefficients are ); σ 2 n l = 1; and σ 2 G l = 5 (curve furthest from dotted diagonal), 1 and 25 (curve nearest to diagonal) as σ 2 G l gets large (i.e., large signal coefficients increase in size), shrinkage rule starts to resemble mid thresholding rule WMTSA: 411 412 III 47

Conditional Mean and Median Approach: VIII now suppose we estimate R l via b R l = U 1 (O l ), where U 1 (O l ) is selected to minimize E{ R l U 1 (O l ) } solution is to set U 1 (o l ) to the median of the PDF for R l given O l = o l to find U 1 (o l ), need to solve for it in the equation Z U1 (o l ) f Rl O l =o l (r l ) dr l = R U1 (o l ) f R l (r l )f nl (o l r l ) dr l R = 1 f Rl (r l )f nl (o l r l ) dr l 2 WMTSA: 411 412 III 48

Conditional Mean and Median Approach: IX simplifying to the sparse signal model, Godfrey & Rocca (1981) show that (, if O U 1 (O l ) l δ; b l O l, otherwise, where δ = σ nl 2 log µ pl σ Gl (1 p l )σ nl 1/2 and b l = σ 2 G l σ 2 G l + σ 2 n l above approximation valid if p l /(1 p l ) σ 2 n l /(σ Gl δ) and σ 2 G l σ 2 n l note that U 1 ( ) is approximately a hard thresholding rule WMTSA: 411 412 III 49

Wavelet-Based Thresholding assume model of deterministic signal plus IID Gaussian noise with mean and variance σ 2 : X = D + using a DWT matrix W, form W = WX = WD+W d+e because IID Gaussian, so is e Donoho & Johnstone (1994) advocate the following: form partial DWT of level J : W 1,..., W J and V J threshold W j s but leave V J alone (i.e., administratively, all N/2 J scaling coefficients assumed to be part of d) use universal threshold δ (u) = [2σ 2 log(n)] use thresholding rule to form W (t) j (hard, etc.) estimate D by inverse transforming W (t) 1,..., W(t) J and V J WMTSA: 417 419 III 5

MAD Scale Estimator: I procedure assumes σ is know, which is not usually the case if unknown, use median absolute deviation (MAD) scale estimator to estimate σ using W 1 median { W 1,, W 1,1,..., W 1, N ˆσ (mad) 2 1 }.6745 heuristic: bulk of W 1,t s should be due to noise.6745 yields estimator such that E{ˆσ (mad) } = σ when W 1,t s are IID Gaussian with mean and variance σ 2 designed to be robust against large W 1,t s due to signal WMTSA: 42 III 51

MAD Scale Estimator: II example: suppose W 1 has 7 small noise coefficients & 2 large signal coefficients (say, a & b, with 2 ø a < b ): W 1 = [1.23, 1.72,.8,.1, a,.3,.67, b, 1.33] T ordering these by their magnitudes yields.1,.3,.67,.8, 1.23, 1.33, 1.72, a, b median of these absolute deviations is 1.23, so ˆσ (mad) = 1.23/.6745. = 1.82 ˆσ (mad) not influenced adversely by a and b; i.e., scale estimate depends largely on the many small coefficients due to noise WMTSA: 42 III 52

Examples of DWT-Based Thresholding: I 4 X 4 bd (ht), LA(8) 4 bd (ht), D(4) 256 512 768 124 t top plot: NMR spectrum X middle: signal estimate using J = 6 partial LA(8) DWT with hard thresholding and universal threshold level estimated by ˆδ (u) = [2ˆσ 2 (mad) log (N)]. = 6.13 bottom: same, but now using D(4) DWT with ˆδ (u). = 6.49 WMTSA: 418 III 53

Examples of DWT-Based Thresholding: II 4 bd (ht), LA(8) 4 bd (st), LA(8) 4 bd (mt), LA(8) 256 512 768 124 t top: signal estimate using J = 6 partial LA(8) DWT with hard thresholding (repeat of middle plot of previous overhead) middle: same, but now with soft thresholding bottom: same, but now with mid thresholding WMTSA: 419 III 54

Examples of MODWT-Based Thresholding 4 ed (ht), LA(8) 4 ed (st), LA(8) 4 ed (mt), LA(8) 256 512 768 124 t as in previous overhead, but using MODWT rather than DWT because of MODWT filters are normalized differently, universal threshold must be adjusted for each level: δ (u) j [2 σ 2 (mad) log (N)/2j ]. = 6.5/2 j/2 results are identical to what cycle spinning would yield WMTSA: 429 43 III 55

VisuShrink Donoho & Johnstone (1994) recipe with soft thresholding is known as VisuShrink (but really thresholding, not shrinkage) rather than using the universal threshold, can also determine δ for VisuShrink by finding value ˆδ (S) that minimizes SURE, i.e., J X j=1 N j 1 X t= (2ˆσ 2 (mad) W 2 j,t + δ2 )1 [δ 2, ) (W 2 j,t ), as a function of δ, with σ 2 estimated via MAD WMTSA: 42 421 III 56

Examples of DWT-Based Thresholding: III 4 ˆδ (S). = 2.19 4 ˆδ (S). = 3.2 256 512 768 124 t top: VisuShrink estimate based upon level J = 6 partial LA(8) DWT and SURE with MAD estimate based upon W 1 bottom: same, but now with MAD estimate based upon W 1, W 2,..., W 6 (the common variance in SURE is assumed common to all wavelet coefficients) resulting signal estimate of bottom plot is less noisy than for top plot WMTSA: 42 421 III 57

Wavelet-Based Shrinkage: I assume model of stochastic signal plus Gaussian IID noise: X = C + so that W = WX = WC + W R + e component-wise, have W j,t = R j,t + e j,t form partial DWT of level J, shrink W j s, but leave V J alone assume E{R j,t } = (reasonable for W j, but not for V J ) use a conditional mean approach with the sparse signal model R j,t has distribution dictated by (1 I j,t )N (, σg 2 ), where P I j,t = 1 = p and P I j,t = = 1 p R j,t s are assumed to be IID model for e j,t is Gaussian with mean and variance σ 2 note: parameters do not vary with j or t WMTSA: 424 III 58

Wavelet-Based Shrinkage: II model has three parameters σ 2 G, p and σ2, which need to be set let σ 2 R and σ2 W be variances of RVs R j,t and W j,t have relationships σ 2 R = (1 p)σ2 G and σ2 W = σ2 R + σ2 set ˆσ 2 = ˆσ 2 (mad) using W 1 let ˆσ 2 W be sample mean of all W 2 j,t given p, let ˆσ 2 G = (ˆσ2 W ˆσ2 )/(1 p) p usually chosen subjectively, keeping in mind that p is proportion of noise-dominated coefficients (can set based on rough estimate of proportion of small coefficients) WMTSA: 424 426 III 59

Examples of Wavelet-Based Shrinkage 4 4 p =.9 p =.95 4 256 512 768 124 t p =.99 shrinkage signal estimates of the NMR spectrum based upon the level J = 6 partial LA(8) DWT and the conditional mean with p =.9 (top plot),.95 (middle) and.99 (bottom) as p 1, we declare there are proportionately fewer significant signal coefficients, implying need for heavier shrinkage WMTSA: 425 III 6

Comments on Next Generation Denoising: I 1.5.5.5 1.5 2 4 6 8 1 12 t (seconds) R T 2 V 6 T 3 W 6 T 3 W 5 T 3 W 4 T 3 W 3 T 2 W 2 T 2 W 1 classical denoising looks at each W j,t alone; for real world signals, coefficients often cluster within a given level and persist across adjacent levels (ECG series offers an example) X WMTSA: 45 III 61

Comments on Next Generation Denoising: II here are some next generation approaches that exploit these real world properties: Crouse et al. (1998) use hidden Markov models for stochastic signal DWT coefficients to handle clustering, persistence and non-gaussianity Huang and Cressie (2) consider scale-dependent multiscale graphical models to handle clustering and persistence Cai and Silverman (21) consider block thesholding in which coefficients are thresholded in blocks rather than individually (handles clustering) Dragotti and Vetterli (23) introduce the notion of wavelet footprints to track discontinuities in a signal across different scales (handles persistence) WMTSA: 45 452 III 62

Comments on Next Generation Denoising: III classical denoising also suffers from problem of overall significance of multiple hypothesis tests next generation work integrates idea of false discovery rate (Benjamini and Hochberg, 1995) into denoising (see Wink and Roerdink, 24, for an applications-oriented discussion) for more recent developments (there are a lot!!!), see review article by Antoniadis (27) Chapters 3 and 4 of book by Nason (28) October 29 issue of Statistica Sinica, which has a special section entitled Multiscale Methods and Statistics: A Productive Marriage III 63

References A. Antoniadis (27), Wavelet Methods in Statistics: Some Recent Developments and Their Applications, Statistical Surveys, 1, pp. 16 55 Y. Benjamini and Y. Hochberg (1995), Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing, Journal of the Royal Statistical Society, Series B, 57, pp. 289 3 T. Cai and B. W. Silverman (21), Incorporating Information on Neighboring Coefficients into Wavelet Estimation, Sankhya Series B, 63, pp. 127 48 M. S. Crouse, R. D. Nowak and R. G. Baraniuk (1998), Wavelet-Based Statistical Signal Processing Using Hidden Markov Models, IEEE Transactions on Signal Processing, 46, pp. 886 92 P. L. Dragotti and M. Vetterli (23), Wavelet Footprints: Theory, Algorithms, and Applications, IEEE Transactions on Signal Processing, 51, pp. 136 23 H. C. Huang and N. Cressie (2), Deterministic/Stochastic Wavelet Decomposition for Recovery of Signal from Noisy Data, Technometrics, 42, pp. 262 76 G. P. Nason (28), Wavelet Methods in Statistics with R, Springer, Berlin III 64

A. M. Wink and J. B. T. M. Roerdink (24), Denoising Functional MR Images: A Comparison of Wavelet Denoising and Gaussian Smoothing, IEEE Transactions on Medical Imaging, 23(3), pp. 374 87 III 65