Wavelet Methods for Time Series Analysis. Part II: Wavelet-Based Statistical Analysis of Time Series

Wavelet Methods for Time Series Analysis Part II: Wavelet-Based Statistical Analysis of Time Series topics to covered: wavelet variance (analysis phase of MODWT) wavelet-based signal extraction (synthesis phase of DWT) wavelet-based decorrelation of time series (analysis phase of DWT, but synthesis phase plays a role also) II

Wavelet Variance: Overview review of decomposition of sample variance using wavelets theoretical wavelet variance for stochastic processes stationary processes nonstationary processes with stationary differences sampling theory for Gaussian processes real-world examples extensions and summary II 2

Decomposing Sample Variance of Time Series let X, X,..., X N represent time series with N values let X denote sample mean of X t s: X N P N t= X t let ˆσ X 2 denote sample variance of X t s: ˆσ 2 X N N X t= Xt X 2 idea is to decompose (analyze, break up) ˆσ X 2 into pieces that quantify how one time series might differ from another wavelet variance does analysis based upon differences between (possibly weighted) adjacent averages over scales II 3

Empirical Wavelet Variance define empirical wavelet variance for scale τ j 2 j as ν 2 X (τ j) N N X t= fw 2 j,t, where f W j,t L j X l= h j,l X t l mod N if N = 2 J, obtain analysis (decomposition) of sample variance: ˆσ 2 X = N N X t= Xt X 2 = JX j= ν 2 X (τ j) (if N not a power of 2, can analyze variance to any level J, but need additional component involving scaling coefficients) interpretation: ν X 2 (τ j) is portion of ˆσ X 2 due to changes in averages over scale τ j ; i.e., scale by scale analysis of variance WMTSA: 298 II 4

Example of Empirical Wavelet Variance wavelet variances for time series X t and Y t of length N = 6, each with zero sample mean and same sample variance X t Y t 2 2 2 2.................................3..3......... 5 5 2 4 8 t τ j ν 2 X (τ j) ν 2 Y (τ j) WMTSA: 298 II 5

Theoretical Wavelet Variance: I now assume X t is a real-valued random variable (RV) let {X t, t Z} denote a stochastic process, i.e., collection of RVs indexed by time t (here Z denotes the set of all integers) apply jth level equivalent MODWT filter { h j,l } to {X t } to create a new stochastic process: W j,t L j X l= which should be contrasted with fw j,t L j X l= h j,l X t l, t Z, h j,l X t l mod N, t =,,..., N WMTSA: 295 296 II 6

Theoretical Wavelet Variance: II if Y is any RV, let E{Y } denote its expectation let var {Y } denote its variance: var {Y } E{(Y E{Y }) 2 } definition of time dependent wavelet variance: ν 2 X,t (τ j) var {W j,t }, with conditions on X t so that var {W j,t } exists and is finite ν 2 X,t (τ j) depends on τ j and t will focus on time independent wavelet variance ν 2 X (τ j) var {W j,t } (can adapt theory to handle time varying situation) ν 2 X (τ j) well-defined for stationary processes and certain related processes, so let s review concept of stationarity WMTSA: 295 296 II 7

Definition of a Stationary Process if U and V are two RVs, denote their covariance by cov {U, V } = E{(U E{U})(V E{V })} stochastic process X t called stationary if E{X t } = µ X for all t, i.e., constant independent of t cov{x t, X t+τ } = s X,τ, i.e., depends on lag τ, but not t s X,τ, τ Z, is autocovariance sequence (ACVS) s X, = cov{x t, X t } = var{x t }; i.e., variance same for all t WMTSA: 266 II 8

Wavelet Variance for Stationary Processes for stationary processes, wavelet variance decomposes var {X t }: X νx 2 (τ j) = var {X t }, j= which is similar to JX ν X 2 (τ j) = ˆσ X 2 j= νx 2 (τ j) is thus contribution to var {X t } due to scale τ j note: νx 2 (τ j) and Xt 2 have same units (can be important for interpretability) WMTSA: 296 297 II 9

White Noise Process simplest example of a stationary process is white noise process X t said to be white noise if it has a constant mean E{X t } = µ X it has a constant variance var {X t } = σ 2 X cov {X t, X t+τ } = for all t and nonzero τ; i.e., distinct RVs in the process are uncorrelated ACVS for white noise takes a very simple form: ( σ 2 s X,τ = cov {X t, X t+τ } = X, τ = ;, otherwise. WMTSA: 268 II

Wavelet Variance for White Noise Process: I for a white noise process, can show that νx 2 (τ j) = var {X t} 2 j τj since τ j = 2 j note that X νx 2 (τ j) = var {X t } 2 + 4 + 8 + = var {X t }, j= as required note also that log (ν 2 X (τ j)) log (τ j ), so plot of log (ν 2 X (τ j)) vs. log (τ j ) is linear with a slope of WMTSA: 296 297, 337 II

Wavelet Variance for White Noise Process: II 2 δ = slope = O O O O O - O O O -2-2 3! ν 2 X (τ j) versus τ j for j =,..., 8 (left-hand plot), along with sample of length N = 256 of Gaussian white noise largest contribution to var {X t } is at smallest scale τ note: later on, we will discuss fractionally differenced (FD) processes that are characterized by a parameter δ; when δ =, an FD process is the same as a white noise process WMTSA: 296 297, 337 II 2

Generalization to Certain Nonstationary Processes if wavelet filter is properly chosen, ν 2 X (τ j) well-defined for certain processes with stationary backward differences (increments); these are also known as intrinsically stationary processes first order backward difference of X t is process defined by X () t = X t X t second order backward difference of X t is process defined by X (2) t = X () t X () t = X t 2X t + X t 2 X t said to have dth order stationary backward differences if Y t dx k= µ d k ( ) k X t k forms a stationary process (d is a nonnegative integer) WMTSA: 287 289 II 3

Examples of Processes with Stationary Increments X t X () t X (2) t (a) (b) (c) 256 256 256 t t t st column shows, from top to bottom, realizations from (a) random walk: X t = P t u= u, & t is zero mean white noise (b) like (a), but now t has mean of.2 (c) random run: X t = P t u= Y u, where Y t is a random walk 2nd & 3rd columns show st & 2nd differences X () t and X (2) t WMTSA: 287 289 II 4

Wavelet Variance for Processes with Stationary Backward Differences: I let {X t } be nonstationary with dth order stationary differences if we use a Daubechies wavelet filter of width L satisfying L 2d, then νx 2 (τ j) is well-defined and finite for all τ j, but now X νx 2 (τ j) = k= j= works because there is a backward difference operator of order d = L/2 embedded within { h j,l }, so this filter reduces X t to dx µ d ( ) k X k t k = Y t and then creates localized weighted averages of Y t s WMTSA: 35 II 5

Wavelet Variance for Random Walk Process: I random walk process X t = P t u= u has first order (d = ) stationary differences since X t X t = t (i.e., white noise) L 2d holds for all wavelets when d = ; for Haar (L = 2), νx 2 (τ j) = var { µ t} τ 6 j + var { t} τ 2τ j 6 j, with the approximation becoming better as τ j increases note that ν 2 X (τ j) increases as τ j increases log (ν 2 X (τ j)) log (τ j ) approximately, so plot of log (ν 2 X (τ j)) vs. log (τ j ) is approximately linear with a slope of + as required, also have X νx 2 (τ j) = var { t} + 6 2 + 2 + 4 + 4 + 8 + j= = WMTSA: 337 II 6

Wavelet Variance for Random Walk Process: II 2 O O δ = slope - O O O O O O -2-2 3! ν 2 X (τ j) versus τ j for j =,..., 8 (left-hand plot), along with sample of length N = 256 of a Gaussian random walk process smallest contribution to var {X t } is at smallest scale τ note: a fractionally differenced process with parameter δ = is the same as a random walk process WMTSA: 337 II 7

Fractionally Differenced (FD) Processes: I can create a continuum of processes that interpolate between white noise and random walks and extrapolate beyond them using notion of fractional differencing (Granger and Joyeux, 98; Hosking, 98) FD(δ) process is determined by 2 parameters δ and σ 2, where < δ < and σ 2 > (σ 2 is less important than δ) if δ < /2, FD process {X t } is stationary, and, in particular, reduces to white noise if δ = has long memory or long range dependence if δ > is antipersistent if δ < (i.e., cov {X t, X t+ } < ) WMTSA: 28 285 II 8

Fractionally Differenced (FD) Processes: II if δ /2, FD process {X t } is nonstationary with dth order stationary backward differences {Y t } here d = bδ + /2c, where bxc is integer part of x {Y t } is stationary FD(δ d) process if δ =, FD process is the same as a random walk process except possibly for two or three smallest scales, have ν 2 X (τ j) Cτ 2δ j thus log (ν 2 X (τ j)) log (C)+(2δ ) log (τ j ), so a log/log plot of ν 2 X (τ j) vs. τ j looks approximately linear with slope 2δ for τ j large enough WMTSA: 287 288, 297 II 9

LA(8) Wavelet Variance for 2 FD Processes 2 δ = 4 O O O O O O O O - -2 2 δ = 2 O O O O O O O O - -2-2 3! see overhead 2 for δ = (white noise), which has slope = δ = 4 has slope 2 δ = 2 has slope (related to so-called pink noise ) WMTSA: 287 288, 297 II 2

LA(8) Wavelet Variance for 2 More FD Processes 2 δ = 5 6 - O O O O O O O O -2 2 δ = - O O O O O O O O -2-2 3! δ = 5 6 has slope 2 3 (related to Kolmogorov turbulence) δ = has slope (random walk) nonnegative slopes indicate nonstationarity, while negative slopes indicate stationarity WMTSA: 287 288, 297 II 2

Wavelet Variance for Processes with Stationary Backward Differences: II summary: ν 2 X (τ j) well-defined for process {X t } that is stationary nonstationary with dth order stationary increments, but width of wavelet filter must satisfy L 2d if {X t } is stationary, then X νx 2 (τ j) = var {X t } < j= (recall that each RV in a stationary process must have the same finite variance) WMTSA: 299 3, 35 II 22

Wavelet Variance for Processes with Stationary Backward Differences: III if {X t } is nonstationary, then X νx 2 (τ j) = j= with a suitable construction, we can take variance of nonstationary process with dth order stationary increments to be using this construction, we have X νx 2 (τ j) = var {X t } j= for both the stationary and nonstationary cases WMTSA: 299 3, 35 II 23

Background on Gaussian Random Variables N (µ, σ 2 ) denotes a Gaussian (normal) RV with mean µ and variance σ 2 will write X = d N (µ, σ 2 ) to mean RV X has same distribution as Gaussian RV RV N (, ) often written as Z (called standard Gaussian or standard normal) let Φ( ) be Gaussian cumulative distribution function Z z Φ(z) P[Z z] = e x2 /2 dx (2π) inverse Φ ( ) of Φ( ) is such that P[Z Φ (p)] = p Φ (p) called p % percentage point WMTSA: 256 257 II 24

Background on Chi-Square Random Variables X said to be a chi-square RV with η degrees of freedom if its probability density function (PDF) is given by f X (x; η) = 2 η/2 Γ(η/2) x(η/2) e x/2, x, η > χ 2 η denotes RV with above PDF two important facts: E{χ 2 η} = η and var {χ 2 η} = 2η let Q η (p) denote the pth percentage point for the RV χ 2 η: P[χ 2 η Q η (p)] = p WMTSA: 263 264 II 25

Expected Value of Wavelet Coefficients in preparation for considering problem of estimating ν 2 X (τ j) given an observed time series, need to consider E{W j,t } if {X t } is nonstationary but has dth order stationary increments, let {Y t } bestationary process obtained by differencing {X t } d times; if {X t } is stationary (d = case), let Y t = X t with µ Y E{Y t }, have E{W j,t } = if either (i) L > 2d or (ii) L = 2d and µ Y = E{W j,t } 6= if µ Y 6= and L = 2d thus have E{W j,t } = if L is picked large enough (L > 2d is sufficient, but might not be necessary) knowing E{W j,t } = eases job of estimating ν 2 X (τ j) considerably WMTSA: 34 35 II 26

Unbiased Estimator of Wavelet Variance: I given a realization of X, X,..., X N from a process with dth order stationary differences, want to estimate ν 2 X (τ j) for wavelet filter such that L 2d and E{W j,t } =, have ν 2 X (τ j) = var {W j,t } = E{W 2 j,t} can base estimator on squares of fw j,t recall that L j X l= W j,t h j,l X t l mod N, t =,,..., N L j X l= h j,l X t l, t Z WMTSA: 36 II 27

Unbiased Estimator of Wavelet Variance: II comparing fw j,t = L j X l= h j,l X t l mod N with W j,t L j X l= h j,l X t l says that f W j,t = W j,t if mod N not needed; this happens when L j t < N (recall that L j = (2 j )(L ) + ) if N L j, unbiased estimator of ν 2 X (τ j) is ˆν 2 X (τ j) N L j + where M j N L j + N X t=l j fw j,t 2 = N X M j t=l j W 2 j,t, WMTSA: 36 II 28

Statistical Properties of ˆν 2 X (τ j) assume that {W j,t } is Gaussian stationary process with mean zero and ACVS {s j,τ } suppose {s j,τ } is such that A j X τ= s 2 j,τ < (if A j =, can make it finite usually by just increasing L) can show that ˆν 2 X (τ j) is asymptotically Gaussian with mean ν 2 X (τ j) and large sample variance 2A j /M j ; i.e., ˆν X 2 (τ j) νx 2 (τ j) (2A j /M j ) /2 = M /2 j (ˆν X 2 (τ j) νx 2 (τ j)) d = (2A j ) /2 N (, ) approximately for large M j N L j + WMTSA: 37 II 29

Estimation of A j in practical applications, need to estimate A j = P τ s2 j,τ can argue that, for large M j, the estimator ŝ (p) 2 M j j, X Â j + ŝ (p) 2 2 j,τ, is approximately unbiased, where ŝ (p) j,τ M j N τ X t=l j τ= fw j,t f Wj,t+ τ, τ M j Monte Carlo results: Â j reasonably good for M j 28 WMTSA: 32 II 3

Confidence Intervals for ν 2 X (τ j): I based upon large sample theory, can form a ( 2p)% confidence interval (CI) for νx 2 (τ j): " p p # ˆν X 2 (τ j) Φ 2Aj ( p) p, ˆν 2 Mj X (τ j) + Φ 2Aj ( p) p ; Mj i.e., random interval traps unknown ν 2 X (τ j) with probability 2p if A j replaced by Âj, get approximate ( 2p)% CI critique: lower limit of CI can very well be negative even though ν 2 X (τ j) always can avoid this problem by using a χ 2 approximation WMTSA: 3 II 3

Confidence Intervals for ν 2 X (τ j): II χ 2 η useful for approximating distribution of sum of squared Gaussian RVs, which is what we are dealing with here: ˆν X 2 (τ j) = N X M j t=l j W 2 j,t idea is to assume ˆν 2 X (τ j) d = aχ 2 η, where a and η are constants to be set via moment matching because E{χ 2 η} = η and var {χ 2 η} = 2η, we have E{aχ 2 η} = aη and var {aχ 2 η} = 2a 2 η can equate E{ˆν 2 X (τ j)} & var {ˆν 2 X (τ j)} to aη & 2a 2 η to determine a & η WMTSA: 33 II 32

obtain Confidence Intervals for ν 2 X (τ j): III η = 2 E{ˆν X 2 (τ j)} 2 var {ˆν X 2 (τ j)} = 2ν4 X (τ j) var {ˆν X 2 (τ j)} and a = ν2 X (τ j) ν after η has been determined, can obtain a CI for νx 2 (τ j): with probability 2p, the random interval " ηˆν 2 X (τ j ) Q η ( p), ηˆν2 X (τ # j) Q η (p) traps the true unknown ν 2 X (τ j) lower limit is now nonnegative as N, above CI and Gaussian-based CI converge WMTSA: 33 II 33

Three Ways to Set η. use large sample theory with appropriate estimates: η = 2ν4 X (τ j) var {ˆν 2 X (τ j)} 2ν4 X (τ j) 2A j /M j suggests ˆη = M j ˆν 4 X (τ j) Â j 2. make an assumption about the effect of wavelet filter on {X t } to obtain simple approximation η 3 = max{m j /2 j, } (this effective but conservative approach is valuable if there are insufficient data to reliably estimate A j ) 3. third way requires assuming shape of spectral density function associated with {X t } (questionable assumption, but common practice in, e.g., atomic clock literature) WMTSA: 33 35 II 34

Atomic Clock Deviates: I 75 X t 5 5 X () t 2 25 X (2) t 25 53 26 t t (days) WMTSA: 37 38 II 35

Atomic Clock Deviates: II top plot: errors {X t } in time kept by atomic clock 57 (measured in microseconds:,, microseconds = second) middle: st backward differences {X () t } in nanoseconds ( nanoseconds = microsecond) bottom: 2nd backward differences {X (2) t }, also in nanoseconds if {X t } nonstationary with dth order stationary increments, need L 2d, but might need L > 2d to get E{W j,t } = might regard {X () t } as realization of stationary process, but, if so, with a mean value far from ; {X (2) t } resembles realization of stationary process, but mean value still might not be if we believe there is a linear trend in {X () t }; thus might need L 6, but could get away with L 4 WMTSA: 37 38 II 36

Atomic Clock Deviates: III 5 4 3 2 x x x x x x + o x + o x o x x o o o o o o + + + + o + o o o o o o 2 3 2 3 τ j t (days) τ j t (days) WMTSA: 39 II 37

Atomic Clock Deviates: IV square roots of wavelet variance estimates for atomic clock time errors {X t } based upon unbiased MODWT estimator with Haar wavelet (x s in left-hand plot, with linear fit) D(4) wavelet (circles in left- and right-hand plots) D(6) wavelet (pluses in left-hand plot). Haar wavelet inappropriate need {X () t } to be a realization of a stationary process with mean (stationarity might be OK, but mean is way off) linear appearance can be explained in terms of nonzero mean 95% confidence intervals in the right-hand plot are the square roots of intervals computed using the chi-square approximation with η given by ˆη for j =,..., 6 and by η 3 for j = 7 & 8 WMTSA: 39 II 38

Annual Minima of Nile River 5 x 3. o x o x o x o 9 6 3 years 2 4 8 scale (years) left-hand plot: annual minima of Nile River right: Haar ˆν 2 X (τ j) before (x s) and after (o s) year 75.5, with 95% confidence intervals based upon χ 2 η 3 approximation. WMTSA: 326 327 II 39

Wavelet Variance Analysis of Time Series with Time-Varying Statistical Properties each wavelet coefficient f W j,t formed using portion of X t suppose X t associated with actual time t + t t t is actual time of first observation X t is spacing between adjacent observations suppose h j,l is least asymmetric Daubechies wavelet can associate f W j,t with an interval of width 2τ j t centered at t + (2 j (t + ) ν (H) j mod N) t, where, e.g., ν (H) j = [7(2 j ) + ]/2 for LA(8) wavelet can thus form localized wavelet variance analysis (implicitly assumes stationarity or stationary increments locally) WMTSA: 4 5 II 4

Subtidal Sea Level Fluctuations: I 8 6 4 2 X 2 4 98 984 988 99 years subtidal sea level fluctuations X for Crescent City, CA, collected by National Ocean Service with permanent tidal gauge N = 8746 values from Jan 98 to Dec 99 (almost 2 years) one value every 2 hours, so t = /2 day subtidal is what remains after diurnal & semidiurnal tides are removed by low-pass filter (filter seriously distorts frequency band corresponding to first physical scale τ t = /2 day) WMTSA: 85 86 II 4

Subtidal Sea Level Fluctuations: II es 7 ed 6 ed 4 ed 2 8 4 4 98 984 988 99 years X level J = 7 LA(8) MODWT multiresolution analysis WMTSA: 86 II 42

Subtidal Sea Level Fluctuations: III... o. 98 984 988 99 years estimated time-dependent LA(8) wavelet variances for physical scale τ 2 t = day based upon averages over monthly blocks (3.5 days, i.e., 6 data points) plot also shows a representative 95% confidence interval based upon a hypothetical wavelet variance estimate of /2 and a chi-square distribution with ν = 5.25 WMTSA: 324 326 II 43

Subtidal Sea Level Fluctuations: IV day o o o o o o 4 days o o o o o o 6 days o o o o o o J F M A M J J A S O N D month o o o o o o o o o o o o o o o o o o 2 days o o o o o o 8 days o o o o o o 32 days o o o o o o J F M A M J J A S O N D month estimated LA(8) wavelet variances for physical scales τ j t = 2 j 2 days, j = 2,..., 7, grouped by calendar month o o o o o o o o o o o o o o o o o o WMTSA: 324 326 II 44

Some Extensions wavelet cross-covariance and cross-correlation (Whitcher, Guttorp and Percival, 2; Serroukh and Walden, 2a, 2b) asymptotic theory for non-gaussian processes satisfying a certain mixing condition (Serroukh, Walden and Percival, 2) biased estimators of wavelet variance (Aldrich, 25) unbiased estimator of wavelet variance for gappy time series (Mondal and Percival, 2a) robust estimation (Mondal and Percival, 2b) wavelet variance for random fields (Mondal and Percival, 2c) wavelet-based characteristic scales (Keim and Percival, 2) II 45

Summary wavelet variance gives scale-based analysis of variance presented statistical theory for Gaussian processes with stationary increments in addition to the applications we have considered, the wavelet variance has been used to analyze genome sequences changes in variance of soil properties canopy gaps in forests accumulation of snow fields in polar regions boundary layer atmospheric turbulence regular and semiregular variable stars II 46

Wavelet-Based Signal Extraction: Overview outline key ideas behind wavelet-based approach description of four basic models for signal estimation discussion of why wavelets can help estimate certain signals simple thresholding & shrinkage schemes for signal estimation wavelet-based thresholding and shrinkage discuss some extensions to basic approach II 47

Wavelet-Based Signal Estimation: I DWT analysis of X yields W = WX DWT synthesis X = W T W yields multiresolution analysis by splitting W T W into pieces associated with different scales DWT synthesis can also estimate signal hidden in X if we can modify W to get rid of noise in the wavelet domain if W is a noise reduced version of W, can form signal estimate via W T W WMTSA: 393 II 48

Wavelet-Based Signal Estimation: II key ideas behind simple wavelet-based signal estimation certain signals can be efficiently described by the DWT using all of the scaling coefficients a small number of large wavelet coefficients noise is manifested in a large number of small wavelet coefficients can either threshold or shrink wavelet coefficients to eliminate noise in the wavelet domain key ideas led to wavelet thresholding and shrinkage proposed by Donoho, Johnstone and coworkers in 99s WMTSA: 393 394 II 49

Models for Signal Estimation: I will consider two types of signals:. D, an N dimensional deterministic signal 2. C, an N dimensional stochastic signal; i.e., a vector of random variables (RVs) with covariance matrix Σ C will consider two types of noise:., an N dimensional vector of independent and identically distributed (IID) RVs with mean and covariance matrix Σ = σ 2 I N 2. η, an N dimensional vector of non-iid RVs with mean and covariance matrix Ση one form: RVs independent, but have different variances another form of non-iid: RVs are correlated WMTSA: 393 394 II 5

Models for Signal Estimation: II leads to four basic signal + noise models for X. X = D + 2. X = D + η 3. X = C + 4. X = C + η in the latter two cases, the stochastic signal C is assumed to be independent of the associated noise WMTSA: 393 394 II 5

Signal Representation via Wavelets: I consider X = D + first signal estimation problem is simplified if we can assume that the important part of D is in its large values assumption is not usually viable in the original (i.e., time domain) representation D, but might be true in another domain an orthonormal transform O might be useful because d = OD is equivalent to D (since D = O T d) we might be able to find O such that the signal is isolated in M ø N large transform coefficients Q: how can we judge whether a particular O might be useful for representing D? WMTSA: 394 II 52

Signal Representation via Wavelets: II let d j be the jth transform coefficient in d = OD let d (), d (),..., d (N ) be the d j s reordered by magnitude: d () d () d (N ) example: if d = [ 3,, 4, 7, 2, ] T, then d () = d 3 = 7, d () = d 2 = 4, d (2) = d = 3 etc. define a normalized partial energy sequence (NPES): P M j= C M d (j) 2 energy in largest M terms P N j= d = (j) 2 total energy in signal let I M be N N diagonal matrix whose jth diagonal term is if d j is one of the M largest magnitudes and is otherwise WMTSA: 394 395 II 53

Signal Representation via Wavelets: III form b D M O T I M d, which is an approximation to D when d = [ 3,, 4, 7, 2, ] T and M = 3, we have 3 I 3 = and thus D b M = O T 4 7 one interpretation for NPES: C M = kd b D M k 2 kdk 2 = relative approximation error WMTSA: 394 395 II 54

Signal Representation via Wavelets: IV D D 2 D 3 64 28 64 28 64 28 t t t consider three signals plotted above D is a sinusoid, which can be represented succinctly by the discrete Fourier transform (DFT) D 2 is a bump (only a few nonzero values in the time domain) D 3 is a linear combination of D and D 2 WMTSA: 395 396 II 55

Signal Representation via Wavelets: V consider three different orthonormal transforms identity transform I (time) the orthonormal DFT F (frequency), where F has (k, t)th element exp( i2πtk/n)/ N for k, t N the LA(8) DWT W (wavelet) # of terms M needed to achieve relative error < %: D D 2 D 3 DFT 2 29 28 identity 5 9 75 LA(8) wavelet 22 4 2 WMTSA: 395 396 II 56

Signal Representation via Wavelets: VI D D 2 D 3..9 64 28 64 28 64 28 M M M use NPESs to see how well these three signals are represented in the time, frequency (DFT) and wavelet (LA(8)) domains time (solid curves), frequency (dotted) and wavelet (dashed) WMTSA: 395 396 II 57

Signal Representation via Wavelets: IX 4 DFT LA(8) DWT X 4 bd 5 4 bd 4 bd 2 4 bd 4 52 24 52 24 t t example: DFT b D M (left-hand column) & J = 6 LA(8) DWT bd M (right) for NMR series X (A. Maudsley, UCSF) WMTSA: 43 432 II 58

Signal Estimation via Thresholding: I thresholding schemes involve. computing O OX 2. defining O (t) as vector with lth element O (t) l = (, if O l δ; some nonzero value, where nonzero values are yet to be defined 3. estimating D via b D (t) O T O (t) otherwise, simplest scheme is hard thresholding ( kill/keep strategy): ( O (ht), if O l = l δ; O l, otherwise. WMTSA: 399 II 59

Hard Thresholding Function plot shows mapping from O l to O (ht) l 3δ 2δ δ O (ht) l δ 2δ 3δ 3δ 2δ δ δ 2δ 3δ O l WMTSA: 399 II 6

Signal Estimation via Thresholding: II hard thresholding is strategy that arises from solution to simple optimization problem, namely, find b D M such that γ m kx b D m k 2 + mδ 2 is minimized over all possible b D m = O T I m O, m =,..., N δ is a fixed parameter that is set a priori (we assume δ > ) kx b D m k 2 is a measure of fidelity rationale for this term: b Dm shouldn t stray too far from X (particularly if signal-to-noise ratio is high) fidelity increases (the measure decreases) as m increases in minimizing γ m, consideration of this term alone suggests that m should be large WMTSA: 398 II 6

Signal Estimation via Thresholding: III mδ 2 is a penalty for too many terms rationale: heuristic says d = OD consists of just a few large coefficients penalty increases as m increases in minimizing γ m, consideration of this term alone suggests that m should be small optimization problem: balance off fidelity & parsimony can show that γ m = kx b D m k 2 + mδ 2 is minimized when m is set such that I m picks out all coefficients satisfying O 2 j > δ2 WMTSA: 398 II 62

Signal Estimation via Thresholding: IV alternative scheme is soft thresholding: where sign {O l } O (st) l = sign {O l } ( O l δ) +, +, if O l > ;, if O l = ;, if O l <. and (x) + ( x, if x ;, if x <. one rationale for soft thresholding: fits into Stein s class of estimators, for which unbiased estimation of risk is possible WMTSA: 399 4 II 63

Soft Thresholding Function here is the mapping from O l to O (st) l 3δ 2δ δ O (st) l δ 2δ 3δ 3δ 2δ δ δ 2δ 3δ O l WMTSA: 399 4 II 64

Signal Estimation via Thresholding: V third scheme is mid thresholding: where O (mt) l = sign {O l } ( O l δ) ++, ( O l δ) ++ ( 2( O l δ) +, if O l < 2δ; O l, otherwise provides compromise between hard and soft thresholding WMTSA: 399 4 II 65

Mid Thresholding Function here is the mapping from O l to O (mt) l 3δ 2δ δ O (mt) l δ 2δ 3δ 3δ 2δ δ δ 2δ 3δ O l WMTSA: 399 4 II 66

Signal Estimation via Thresholding: VI example of mid thresholding with δ = 3 X t 3 2 2 2 2 3 O l O (mt) l bd (mt) t 3 64 t or l II 67

Universal Threshold Q: how do we go about setting δ? specialize to IID Gaussian noise with covariance σ 2 I N can argue e O is also IID Gaussian with covariance σ 2 I N Donoho & Johnstone (995) proposed δ (u) [2σ 2 log(n)] ( log here is log base e ) rationale for δ (u) : because of Gaussianity, can argue that as N [4π log (N)] P max{ e l } > δ (u) l h max l { e l } δ (u)i as N, so no noise and hence P will exceed threshold in the limit WMTSA: 4 42 II 68

Wavelet-Based Thresholding assume model of deterministic signal plus IID Gaussian noise with mean and variance σ 2 : X = D + using a DWT matrix W, form W = WX = WD+W d+e because IID Gaussian, so is e Donoho & Johnstone (994) advocate the following: form partial DWT of level J : W,..., W J and V J threshold W j s but leave V J alone (i.e., administratively, all N/2 J scaling coefficients assumed to be part of d) use universal threshold δ (u) = [2σ 2 log(n)] use thresholding rule to form W (t) j (hard, etc.) estimate D by inverse transforming W (t),..., W(t) J and V J WMTSA: 47 49 II 69

MAD Scale Estimator: I procedure assumes σ is know, which is not usually the case if unknown, use median absolute deviation (MAD) scale estimator to estimate σ using W median { W,, W,,..., W, N ˆσ (mad) 2 }.6745 heuristic: bulk of W,t s should be due to noise.6745 yields estimator such that E{ˆσ (mad) } = σ when W,t s are IID Gaussian with mean and variance σ 2 designed to be robust against large W,t s due to signal WMTSA: 42 II 7

MAD Scale Estimator: II example: suppose W has 7 small noise coefficients & 2 large signal coefficients (say, a & b, with 2 ø a < b ): W = [.23,.72,.8,., a,.3,.67, b,.33] T ordering these by their magnitudes yields.,.3,.67,.8,.23,.33,.72, a, b median of these absolute deviations is.23, so ˆσ (mad) =.23/.6745. =.82 ˆσ (mad) not influenced adversely by a and b; i.e., scale estimate depends largely on the many small coefficients due to noise WMTSA: 42 II 7

Examples of DWT-Based Thresholding: I 6 4 2 X 2 2 4 6 8 t NMR spectrum WMTSA: 48 II 72

Examples of DWT-Based Thresholding: II 6 4 2 bd (ht) 2 2 4 6 8 signal estimate using J = 6 partial D(4) DWT with hard thresholding and universal threshold level estimated by ˆδ (u) = [2ˆσ 2 (mad) log (N)]. = 6.49 t WMTSA: 48 II 73

Examples of DWT-Based Thresholding: III 6 4 2 bd (ht) 2 2 4 6 8 t same as before, but now using LA(8) DWT with ˆδ (u). = 6.3 WMTSA: 48 II 74

Examples of DWT-Based Thresholding: IV 6 4 2 bd (st) 2 2 4 6 8 signal estimate using J = 6 partial LA(8) DWT, but now with soft thresholding t WMTSA: 48 II 75

Examples of DWT-Based Thresholding: V 6 4 2 bd (mt) 2 2 4 6 8 signal estimate using J = 6 partial LA(8) DWT, but now with mid thresholding t WMTSA: 48 II 76

MODWT-Based Thresholding can base thresholding procedure on MODWT rather than DWT, yielding signal estimators e D (ht), e D (st) and e D (mt) because MODWT filters are normalized differently, universal threshold must be adjusted for each level: δ (u) j [2 σ 2 (mad) log (N)/2j ], where now MAD scale estimator is based on unit scale MODWT wavelet coefficients results are identical to what cycle spinning would yield WMTSA: 429 43 II 77

Examples of MODWT-Based Thresholding: I 6 4 2 ed (ht) 2 2 4 6 8 signal estimate using J = 6 LA(8) MODWT with hard thresholding t WMTSA: 429 43 II 78

Examples of MODWT-Based Thresholding: II 6 4 2 ed (st) 2 2 4 6 8 t same as before, but now with soft thresholding WMTSA: 429 43 II 79

Examples of MODWT-Based Thresholding: III 6 4 2 ed (mt) 2 2 4 6 8 t same as before, but now with mid thresholding WMTSA: 429 43 II 8

Signal Estimation via Shrinkage: I so far, we have only considered signal estimation via thresholding rules, which will map some O l to zeros will now consider shrinkage rules, which differ from thresholding only in that nonzero coefficients are mapped to nonzero values rather than exactly zero (but values can be very close to zero!) several ways in which shrinkage rules arise will consider a conditional mean approach (identical to a Bayesian approach) II 8

Background on Conditional PDFs: I let X and Y be RVs with marginal probability density functions (PDFs) f X ( ) and f Y ( ) let f X,Y (x, y) be their joint PDF at the point (x, y) conditional PDF of Y given X = x is defined as f Y X=x (y) = f X,Y (x, y) f X (x) f Y X=x ( ) is a PDF, so its mean value is E{Y X = x} = Z yf Y X=x (y) dy; the above is called the conditional mean of Y, given X WMTSA: 258 26 II 82

Background on Conditional PDFs: II suppose RVs X and Y are related, but we can only observe X want to approximate unobservable Y based on function of X example: X represents a stochastic signal Y buried in noise suppose we want our approximation to be the function of X, say U 2 (X), such that the mean square difference between Y and U 2 (X) is as small as possible; i.e., we want to be as small as possible E{(Y U 2 (X)) 2 } solution is to use U 2 (X) = E{Y X}; i.e., the conditional mean of Y given X is our best guess at Y in the sense of minimizing the mean square error (related to fact that E{(Y a) 2 } is smallest when a = E{Y }) WMTSA: 26 II 83

Conditional Mean Approach: I assume model of stochastic signal plus non-iid noise: X = C + η so that O = OX = OC + Oη R + n component-wise, have O l = R l + n l because C and η are independent, R and n must be also suppose we approximate R l via b R l U 2 (O l ), where U 2 (O l ) is selected to minimize E{(R l U 2 (O l )) 2 } solution is to set U 2 (O l ) equal to E{R l O l }, so let s work out what form this conditional mean takes to get E{R l O l }, need the PDF of R l given O l, which is f Rl O l =o l (r l ) = f R l,o l (r l, o l ) f Ol (o l ) = f Rl (r l )f nl (o l r l ) R f Rl (r l )f nl (o l r l ) dr l WMTSA: 48 49 II 84

Conditional Mean Approach: II mean value of f Rl O l =o l ( ) yields estimator b R l = E{R l O l }: E{R l O l = o l } = = Z r l f Rl O l =o l (r l ) dr l R r l f Rl (r l )f nl (o l r l )dr l R f Rl (r l )f nl (o l r l ) dr l to make further progress, we need a model for the waveletdomain representation R l of the signal heuristic that signal in the wavelet domain has a few large values and lots of small values suggests a Gaussian mixture model WMTSA: 4 II 85

Conditional Mean Approach: III let I l be an RV such that P [I l = ] = p l & P [I l = ] = p l under Gaussian mixture model, R l has same distribution as I l N (, γ 2 l σ2 G l ) + ( I l )N (, σ 2 G l ) where N (, σ 2 ) is a Gaussian RV with mean and variance σ 2 2nd component models small # of large signal coefficients st component models large # of small coefficients (γ 2 l ø ) example: PDFs for case σ 2 G l =, γ 2 l σ2 G l = and p l =.75.4.3.2.. 5 5 5 5 WMTSA: 4 II 86

Conditional Mean Approach: VI let s simplify to a sparse signal model by setting γ l = ; i.e., large # of small coefficients are all zero distribution for R l same as ( I l )N (, σ 2 G l ) to complete model, let n l obey a Gaussian distribution with mean and variance σ 2 n l conditional mean estimator becomes E{R l O l = o l } = where c l = p l (σ 2 Gl + σn 2 l ) e o2 l b l/(2σn 2 ) l ( p l )σ nl b l +c l o l, WMTSA: 4 II 87

Conditional Mean Approach: VII 6 3 E{R l O l = o l } 3 6 6 3 3 6 o l conditional mean shrinkage rule for p l =.95 (i.e., 95% of signal coefficients are ); σ 2 n l = ; and σ 2 G l = 5 (curve furthest from dotted diagonal), and 25 (curve nearest to diagonal) as σ 2 G l gets large (i.e., large signal coefficients increase in size), shrinkage rule starts to resemble mid thresholding rule WMTSA: 4 42 II 88

Wavelet-Based Shrinkage: I assume model of stochastic signal plus Gaussian IID noise: X = C + so that W = WX = WC + W R + e component-wise, have W j,t = R j,t + e j,t form partial DWT of level J, shrink W j s, but leave V J alone assume E{R j,t } = (reasonable for W j, but not for V J ) use a conditional mean approach with the sparse signal model R j,t has distribution dictated by ( I j,t )N (, σg 2 ), where P I j,t = = p and P I j,t = = p R j,t s are assumed to be IID model for e j,t is Gaussian with mean and variance σ 2 note: parameters do not vary with j or t WMTSA: 424 II 89

Wavelet-Based Shrinkage: II model has three parameters σ 2 G, p and σ2, which need to be set let σ 2 R and σ2 W be variances of RVs R j,t and W j,t have relationships σ 2 R = ( p)σ2 G and σ2 W = σ2 R + σ2 set ˆσ 2 = ˆσ 2 (mad) using W let ˆσ 2 W be sample mean of all W 2 j,t given p, let ˆσ 2 G = (ˆσ2 W ˆσ2 )/( p) p usually chosen subjectively, keeping in mind that p is proportion of noise-dominated coefficients (can set based on rough estimate of proportion of small coefficients) WMTSA: 424 426 II 9

Examples of Wavelet-Based Shrinkage: I 6 4 2 p =.9 2 2 4 6 8 shrinkage signal estimates of NMR spectrum based upon level J = 6 partial LA(8) DWT and conditional mean with p =.9 t WMTSA: 425 II 9

Examples of Wavelet-Based Shrinkage: II 6 4 2 p =.95 2 2 4 6 8 t same as before, but now with p =.95 WMTSA: 425 II 92

Examples of Wavelet-Based Shrinkage: III 6 4 2 p =.99 2 2 4 6 8 same as before, but now with p =.99 (as p, we declare there are proportionately fewer significant signal coefficients, implying need for heavier shrinkage) t WMTSA: 425 II 93

Comments on Next Generation Denoising: I.5.5.5.5 2 4 6 8 2 t (seconds) R T 2 V 6 T 3 W 6 T 3 W 5 T 3 W 4 T 3 W 3 T 2 W 2 T 2 W classical denoising looks at each W j,t alone; for real world signals, coefficients often cluster within a given level and persist across adjacent levels (ECG series offers an example) X WMTSA: 45 II 94

Comments on Next Generation Denoising: II here are some next generation approaches that exploit these real world properties: Crouse et al. (998) use hidden Markov models for stochastic signal DWT coefficients to handle clustering, persistence and non-gaussianity Huang and Cressie (2) consider scale-dependent multiscale graphical models to handle clustering and persistence Cai and Silverman (2) consider block thesholding in which coefficients are thresholded in blocks rather than individually (handles clustering) Dragotti and Vetterli (23) introduce the notion of wavelet footprints to track discontinuities in a signal across different scales (handles persistence) WMTSA: 45 452 II 95

Comments on Next Generation Denoising: III classical denoising also suffers from problem of overall significance of multiple hypothesis tests next generation work integrates idea of false discovery rate (Benjamini and Hochberg, 995) into denoising (see Wink and Roerdink, 24, for an applications-oriented discussion) for more recent developments (there are a lot!!!), see review article by Antoniadis (27) Chapters 3 and 4 of book by Nason (28) October 29 issue of Statistica Sinica, which has a special section entitled Multiscale Methods and Statistics: A Productive Marriage II 96

Wavelet-Based Decorrelation of Time Series: Overview DWT well-suited for decorrelating certain time series, including ones generated from a fractionally differenced (FD) process on synthesis side, leads to DWT-based simulation of FD processes wavelet-based bootstrapping on analysis side, leads to wavelet-based estimators for FD parameters test for homogeneity of variance test for trends (won t discuss see Craigmile et al., 24, for details) II 97

DWT of an FD Process: I 5 X ˆρ X,τ 5 256 52 768 24 32 t τ (lag) realization of an FD(.4) time series X along with its sample autocorrelation sequence (ACS): for τ, P N τ ˆρ X,τ t= X t X t+τ P N t= X2 t note that ACS dies down slowly WMTSA: 34 342 II 98

DWT of an FD Process: II V 7 W 7 W 6 W 5 W 4 W 3 W 2 W 4 2 5 5 5 5 5 5 5 5 5 5 5 5 5 5 32 τ (lag) LA(8) DWT of FD(.4) series and sample ACSs for each W j & V 7, along with 95% confidence intervals for white noise WMTSA: 34 342 II 99

ev 7 fw 7 fw 6 fw 5 fw 4 fw 3 fw 2 2 2 2 2 2 2 2 2 2 2 2 2 MODWT of an FD Process fw 2 2 2 32 τ (lag) LA(8) MODWT of FD(.4) series & sample ACSs for MODWT coefficients, none of which are approximately uncorrelated II

DWT of an FD Process: III in contrast to X, ACSs for W j consistent with white noise variance of RVs in W j increases with j: for FD process, var {W j,t } cτ 2δ j C j, where c is a constant depending on δ but not j, and τ j = 2 j is scale associated with W j for white noise (δ = ), var {W j,t } is the same for all j dependence in X thus manifests itself in wavelet domain by different variances for wavelet coefficients at different scales WMTSA: 343 344 II

Correlations Within a Scale and Between Two Scales let {s X,τ } denote autocovariance sequence (ACVS) for {X t }; i.e., s X,τ = cov {X t, X t+τ } let {h j,l } denote equivalent wavelet filter for jth level to quantify decorrelation, can write cov {W j,t, W j,t } = L j X l= L j X l = h j,l h j,l s X,2 j (t+) l 2 j (t +)+l, from which we can get ACVS (and hence within-scale correlations) for {W j,t }: cov {W j,t, W j,t+τ } = L j X m= (L j ) s X,2 j τ+m L j m X l= h j,l h j,l+ m WMTSA: 345 II 2

Correlations Within a Scale.2..2.2..2.2. j = j = 2 j = 3 j = 4 Haar D(4) LA(8).2 4 4 4 4 τ τ τ τ correlations between W j,t and W j,t+τ for an FD(.4) process correlations within scale are slightly smaller for Haar maximum magnitude of correlation is less than.2 WMTSA: 345 346 II 3

Correlations Between Two Scales: I j = 2 j = 3 j = 4 8 8 τ 8 8 τ.2..2.2..2.2..2 8 8 τ j = j = 2 j = 3 correlation between Haar wavelet coefficients W j,t and W j,t from FD(.4) process and for levels satisfying j < j 4 WMTSA: 346 347 II 4

Correlations Between Two Scales: II j = 2 j = 3 j = 4 8 8 τ 8 8 τ.2..2.2..2.2..2 8 8 τ j = j = 2 j = 3 same as before, but now for LA(8) wavelet coefficients correlations between scales decrease as L increases WMTSA: 346 347 II 5

Wavelet Domain Description of FD Process DWT acts as a decorrelating transform for FD processes and other (but not all!) intrinsically stationary processes wavelet domain description is simple wavelet coefficients within a given scale approximately uncorrelated (refinement: assume st order autoregressive model) wavelet coefficients have scale-dependent variance controlled by the two FD parameters (δ and σ 2 ε) wavelet coefficients between scales also approximately uncorrelated (approximation improves as filter width L increases) WMTSA: 345 35 II 6

DWT-Based Simulation properties of DWT of FD processes lead to schemes for simulating time series X [X,..., X N ] T with zero mean and with a multivariate Gaussian distribution with N = 2 J, recall that X = W T W, where W W 2. W = W j. W J V J WMTSA: 355 II 7

Basic DWT-Based Simulation Scheme assume W to contain N uncorrelated Gaussian (normal) random variables (RVs) with zero mean assume W j to have variance C j = cτ 2δ j assume single RV in V J to have variance C J+ (see Percival and Walden, 2, for details on how to set C J+ ) approximate FD time series X via Y W T Λ /2 Z, where Λ /2 is N N diagonal matrix with diagonal elements C /2,..., C /2 {z, C /2 } 2,..., C /2 {z 2,..., C /2 } J {z, C/2 J, C /2 } 2 of these N 2 of these N 4 of these J, C/2 J+ Z is vector of deviations drawn from a Gaussian distribution with zero mean and unit variance WMTSA: 355 II 8

Refinements to Basic Scheme: I covariance matrix for approximation Y does not correspond to that of a stationary process recall W treats X as if it were circular let T be N N circular shift matrix: Y Y Y T Y Y 2 = Y 2 Y 3 ; T 2 Y Y 2 = Y 3 Y Y 3 Y 2 Y 3 Y Y let κ be uniformily distributed over,..., N define e Y T κ Y e Y is stationary with ACVS given by, say, s ey,τ ; etc. WMTSA: 356 357 II 9

Refinements to Basic Scheme: II Q: how well does {s ey,τ } match {s X,τ }? due to circularity, find that s ey,n τ = s ey,τ for τ =,..., N/2 implies s ey,τ cannot approximate s X,τ well for τ close to N can patch up by simulating e Y with M > N elements and then extracting first N deviates (M = 4N works well) WMTSA: 356 357 II

Refinements to Basic Scheme: III 2.5 2. M = N M = 2N M = 4N.5..5 64 64 64 τ τ τ plot shows true ACVS {s X,τ } (thick curves) for FD(.4) process and wavelet-based approximate ACVSs {s ey,τ } (thin curves) based on an LA(8) DWT in which an N = 64 series is extracted from M = N, M = 2N and M = 4N series WMTSA: 356 357 II

Example and Some Notes 5 5 256 52 768 24 t simulated FD(.4) series (LA(8), N = 24 and M = 4N) notes: can form realizations faster than best exact method can efficiently simulate extremely long time series in realtime (e.g, N = 2 3 =, 73, 74, 824 or even longer!) effect of random circular shifting is to render time series slightly non-gaussian (a Gaussian mixture model) WMTSA: 358 36 II 2

Wavelet-Domain Bootstrapping for many (but not all!) time series, DWT acts as a decorrelating transform: to a good approximation, each W j is a sample of a white noise process, and coefficients from different sub-vectors W j and W j are also pairwise uncorrelated variance of coefficients in W j depends on j scaling coefficients V J are still autocorrelated, but there will be just a few of them if J is selected to be large decorrelating property holds particularly well for FD and other processes with long-range dependence above suggests the following recipe for wavelet-domain bootstrapping of a statistic of interest, e.g., sample autocorrelation sequence ˆρ X,τ at unit lag τ = II 3

Recipe for Wavelet-Domain Bootstrapping. given X of length N = 2 J, compute level J DWT (the choice J = J 3 yields 8 coefficients in W J and V J ) 2. randomly sample with replacement from W j to create bootstrapped vector W (b) j, j =,..., J 3. create V (b) J using st-order autoregressive parametric bootstrap 4. apply W T to W (b),..., W(b) J and V (b) J to obtain bootstrapped time series X (b) and then form ˆρ (b) X, repeat above many times to build up sample distribution of bootstrapped autocorrelations II 4

Illustration of Wavelet-Domain Bootstrapping V 4 W 4 W 3 W 2 W X 32 64 96 28 32 64 96 28 Haar DWT of FD(.45) series X (left-hand column) and waveletdomain bootstrap thereof (right-hand) II 4