Multiresolution local polynomial regression: A new approach to pointwise spatial adaptation

Size: px

Start display at page:

Download "Multiresolution local polynomial regression: A new approach to pointwise spatial adaptation"

Elijah Fleming
5 years ago
Views:

Digital Signal Processing 15 (2005) 73 116 www.elsevier.

Box 553, Tampere, Finland b Department of Mecatronics, Kwangju Institute of Science and Tecnology, Kwangju 500-712, Sout Korea Available online 5 October 2004 Abstract In nonparametric local

1 Digital Signal Processing 15 (2005) Multiresolution local polynomial regression: A new approac to pointwise spatial adaptation Vladimir Katkovnik a,b a Signal Processing Laboratory, Tampere University of Tecnology, P.O. Box 553, Tampere, Finland b Department of Mecatronics, Kwangju Institute of Science and Tecnology, Kwangju , Sout Korea Available online 5 October 2004 Abstract In nonparametric local polynomial regression te adaptive selection of te scale parameter (window size/bandwidt) is a key problem. Recently new efficient algoritms, based on Lepski s approac, ave been proposed in matematical statistics for spatially adaptive varying scale denoising. A common feature of tese algoritms is tat tey form test-estimates ŷ different by te scale H and special statistical rules are exploited in order to select te estimate wit te best pointwise varying scale. In tis paper a novel multiresolution (MR) local polynomial regression is proposed. Instead of selection of te estimate wit te best scale a nonlinear estimate is built using all of te testestimates ŷ. Te adaptive estimation consists of two steps. Te first step transforms te data into noisy spectrum coefficients (MR analysis). On te second step, tis noisy spectrum is filtered by te tresolding procedure and used for estimation (MR syntesis) Publised by Elsevier Inc. Keywords: Adaptive scale; Kernel estimation; Local polynomial regression; Multiresolution analysis; Local nonparametric regression; Denoising; Sliding least square metod 1. Introduction Te present work is devoted to studying te problem of adaptive estimation of a multivariable function given by noisy data. Te developed multiresolution approac (MR) and algoritms are of a general nature and can be applied to a variety of univariate or mul- address: katkov@cs.tut.fi /$ see front matter 2004 Publised by Elsevier Inc. doi: /j.dsp

2 74 V. Katkovnik / Digital Signal Processing 15 (2005) tidimensional problems. However, we prefer to give te presentation in terms of image processing as it allows a convenient and transparent motivation of basic ideas as well as a good illustration of results. Tus, let te function to estimate be a two-dimensional (2D) image intensity given by noisy values on a 2D regular grid. Te adaptive estimation used in tis paper is based on direct nonparametric pointwise estimation of te image intensity witout any preliminary edge recovering. A nonparametric regression estimator is derived from te local polynomial approximation (LPA) in a sliding window wit a varying size (estimator s scale) selected in a data-driven way. Among oters nonparametric approaces to regression estimation te LPA can be treated as probably one of te most teoretically justified and well studied. It is a powerful nonparametric tecnique wic provides estimates in a pointwise manner based on a mean square polynomial fitting in a sliding window (e.g., [2,7,10,14 16,30]). In terms of image processing te LPA is a flexible tool to design 2D kernels (masks) aving prescribed reproducing properties wit respect to polynomial (smoot) components of te signal. Te invariant and varying scale selection for te LPA as been studied torougly by many autors. Optimal, in particular, varying data-driven scale metods are of special interest for problems were te pointwise smoot approximation is natural and relevant. Image denoising provides good examples of tis sort of problems. A crucial difference between te nonparametric LPA and te more traditional parametric metods, say te polynomial mean squared estimates, is tat te latter are formed as unbiased ones wile te nonparametric estimates are biased and te reasonable coice of te biasedness controlled by te scale parameter is of importance. In te nonparametric regression metods adaptive to unknown smootness te adaptive selection of te scale is a key point. Te problem of optimal scale selection admits an accurate matematical formulation in terms of te nonparametric approac, were te optimal scale is defined by a compromise between te bias and te variance of estimation (e.g., [7,16,30]). Te idea of te used Lepski s adaptation metod is as follows, [27 29,38]. Te algoritm searces for a largest local vicinity of te point of estimation were te LPA assumption fits well to te data. Te test-estimates ŷ (x) are calculated for window sizes from te set H, H, and compared. Te adaptive window size is defined as te largest of tose windows wic estimate does not differ significantly from te estimates corresponding to te smaller window sizes. Special statistics are exploited in order to test tese ypoteses and select te best scale giving te optimal balance between te random errors and te biasedness of te estimate. Te Lepski s approac algoritms ave introduced a number of statistical rules wic are proved to be efficient in teory as well as in applications. Te nonlinearity of te metod is incorporated by an adaptive pointwise coice of te scale. A novel spatial adaptivity introduced in tis paper can be viewed as a development of two independent ideas: wavelet multiresolution analysis (e.g., [11,31]) and a pointwise adaptive scale selection based on te Lepski s approac. Instead of selection of te estimate wit te best scale we build a nonlinear estimate using all of te available test-estimates ŷ (x), H. Te adaptive estimation is divided into two successive steps. Te first step transforms te data into noisy spectrum coefficients (MR analysis). In te second step, tese coefficients are filtered by te tresolding

3 V. Katkovnik / Digital Signal Processing 15 (2005) procedure and used for estimation (MR syntesis). Te LPA based filters are exploited for te nonparametric (pointwise) MR spectrum analysis and syntesis. In tis way we introduce te extension of te conventional scale adaptive nonparametric regression concept and yield a wider class of te adaptive scale regression estimators wit a potentially better performance. Te contribution of tis paper is two-fold. Firstly, we present a summary of te nonparametric LPA metods including te basic ideas, algoritms for function and derivative estimation, te accuracy analysis. Tis summary is concluded by a brief review of some recent metods for selection of te adaptive varying scales. Secondly, te new MR local polynomial regression is introduced as a valuable alternative to te conventional adaptive scale nonparametric regression. Te rest of te paper is organized as follows. In Section 2 te observation model as well as te estimation problem are discussed. Te LPA metod, motivation, basic algoritms and te accuracy analysis are presented in Sections 3 and 4. Te Lepski s adaptive varying scale algoritms are reviewed in Section 4.3. Te nonparametric regression spectrum and MR analysis are introduced in Section 5. Te multiresolution filtering based on te MR spectrum tresolding is presented in Section 6. Te optimality of te adaptive scale local regression estimation is discussed briefly in Section 7. Implementation of te introduced algoritms as well as teir complexity is a subject of Section 8. Similarity and difference of te MR local polynomial regression versus te wavelet MR tecniques are discussed in Section 9. In Section 10 simulation results are presented. It is sown tat appropriate filtering of te noisy MR spectrum allows to acieve a better performance tat it can be done using te more traditional nonparametric approac based on selection of te best varying adaptive scale. 2. Observation model Te following model, commonly used for image denoising, is assumed: z(x) = y(x)+ σ ε(x), (1) were an intensity y of te underlying image is defined as a function of two variables, y R, ε is an additive noise. It is assumed tat all functions in (1) are defined on a 2D rectangular regular grid x X R 2 wit pixels x = (x 1,x 2 ): X = { (x 1,x 2 ): x 1 = s 1, x 2 = s 2 }, s 1 = 1,...,n 1,s 2 = 1,...,n 2, (2) were s 1, s 2 are integers, denotes te sampling interval, and n = n 1 n 2 is a total number of observations.te randomnoise ε is assumed to be standard Gaussian i.i.d. for different x wit E{ε 2 }=1. Te basic objective is to reconstruct (estimate) y(x) and derivatives of y(x) for any x X from noisy observations {z(x), x X} wit te pointwise mean squared error (MSE) risk wic is as small as possible. It is assumed tat y is unknown deterministic. For stocastic y it means tat te main intention is to obtain te best result for every realization of y even if tey are generated by a probabilistic penomenon.

4 76 V. Katkovnik / Digital Signal Processing 15 (2005) Te discrete observations {y(x}, x X} are obtained as samples of an underlying continuous argument y(x), x R 2. Different ypoteses on tis y can be applied for derivation and analysis of algoritms. Here we follow te nonparametric regression approac assuming tat a parametric representation of y as a function of x, say in te form of a series or function wit reasonably small number of invariant parameters, does not exist or unknown. Te following piecewise model of y is appropriate for metods considered in tis paper. Let a finite support of y can be separated into Q regions A q, q = 1,...,Q, eac of tem is a connected set wit an edge (boundary) G q. Te function y is assumed to be smoot differentiable witin eac region A q : y(x) = Q y q (x)1[x A q ], (3) q=1 were 1[x A q ] is an indicator of te region A q, 1[x A q ]=1ifx A q and zero oterwise, and y q is a continuous differentiable function belonging to te class F r ( L r ) = { y: max r 1 +r 2 = r y(r) (x) =L r (x) L r, r 1 + r 2 = r, x R 2}. (4) Here and in wat follows r = (r 1,r 2 ) is a multi-index, r 1, r 2 nonnegative integer and r = r 1 + r 2. A derivative corresponding to r is y (r) (x) = r 1 +r 2 x r 1 1 x r 2 y(x), L r is a finite constant 2 in (4). Te piecewise constant model of y y(x) = Q a q 1[x A q ], 0 a q 1, (5) q=1 is a particular case of (3) wit constant values witin eac region A q. In te models (3) and (5) y q, a q as well as te regions A q are unknown. Te boundaries G q define cange points of te piecewise smoot y. Te estimation of y can be produced in two different ways. One of te possible approaces deals wit a two-stage procedure including estimation of te boundaries G q on te first stage, wic defining te regions A q. Te second stage is a parametric or nonparametric fitting y q on A q. Anoter approac is connected wit te concept of spatially adaptive estimation. In tis context, te cange points or, more generally, cusps in te curves can be viewed as a sort of an inomogeneous beavior. One may terefore apply te same procedure, for instance nonlinear wavelet, ridgelet, curvelet estimators for all x, and te analysis focuses on te quality estimation wen te cange-points are incorporated or not incorporated in te model. Under tis approac, te main objective is to estimate te function and not location of cange-points wic are treated as features of te function surface. In tis paper we follow te second approac. Te objective is to develop a metod wic simultaneously adapts to varying smootness of te estimated function and wic is sensitive to discontinuities of te function and its derivatives.

5 V. Katkovnik / Digital Signal Processing 15 (2005) Local polynomial approximation 3.1. Foundations Te idea of te LPA is simple and natural. It is assumed tat te function y is well approximated by a polynomial in a neigborood of te point of interest x. Wefindte coefficients of te polynomial fit by te weigted least square metod and use tis approximation in order to calculate te estimate for te point of interest x called also centre of te LPA. In fact, te local expansion is applied in order to calculate te estimate for tis point of interest only. For te next point te calculations are repeated. Tis pointwise procedure determines a nonparametric caracter of te LPA estimation. Te linear LPA estimators ave a very long preistory (e.g., [2,7,14 16,30]). Tey are a very popular tool in statistics and signal processing wit application to a wide variety of te fields for smooting, filtering, differentiation, interpolation and extrapolation. Note tat te LPA as appeared in signal processing in a number of modifications and under different names: sliding (moving) least square, Savitzky Golay filter, local regression, reproducing kernel metod, moment filters, etc. We prefer te term LPA wit a reference to publications on nonparametric estimation in matematical statistics were te advanced development of tis tecnique can be seen. In tis section we summarize te well known concepts of te discrete LPA. Let x R 2 be a centre (reference point) of te LPA. Ten, te estimate for v R 2 in te neigborood of te centre x is presented as an expansion: y(x,v) = C T φ(x v), φ(x)= ( φ 1 (x), φ 2 (x),..., φ M (x) ) T, C = (C 1,C 2,...,C M ) T, (6) were φ(x) R M is a vector of linear independent 2D polynomials of te powers from 0uptom, C R M is a vector of parameters of tis model. In particular, te following polynomials can be exploited x k 1 1 xk 2 2 k 1!k 2!, 0 k 1 + k 2 m, k 1,k 2 0. (7) A total number of tese 2D polynomials is equal to M = (m + 2)(m + 1)/2. For m 3we obtain φ 1 = 1 form = 0, φ 2 = x 1, φ 3 = x 2 for m = 1, φ 4 = x1 2 /2, φ 5 = x2 2 /2, φ 6 = x 1 x 2 for m = 2, φ 7 = x1 3 /6, φ 8 = x2 3 /6, φ 9 = x1 2 x 2/2, φ 10 = x 1 x2 2 /2 form = 3, (8) wit M = 1, 3, 6,and10form = 0, 1, 2, 3, respectively. Te term centre does not assume a central position of x in te neigborood. It only empasizes tat te LPA is exploited in order to obtain te estimate for tis particular value of te argument of y.

6 78 V. Katkovnik / Digital Signal Processing 15 (2005) Table 1 Window name w(x) Rectangular symmetric 1, x 1/2, Rectangular nonsymmetric 1, 0 x 1, Exponential 1 2 exp( x ), Gaussian 1 exp( x 2 /2), 2π Epanecnikov 3 4 (1 x2 ), x 1, Bisquare window (1 x 2 ) 2, x 1, Tricube (1 x 3 ) 3, x 1, Triweigt window (1 x 2 ) 3, x 1. Te conventional quadratic criterion function is applied in order to find te coefficient C in (6): J (x) = v X w (x v) ( z(v) y(x,v) ) 2, (9) were {z(v), v X} are discrete observations and te window w (x) = w(x/)/ 2 (10) is used to formulate matematically te fitting localized in a neigborood of x, wile te scale parameter >0 determines te size of te neigborood. Te windowing weigt w is usually assumed to satisfy te properties: w(x) 0, w(0) = max w(x), w(x)dx = 1, w 2 (x) dx <. x R 2 R 2 (11) Te multiplicative window w(x) = w 1 (x 1 )w 2 (x 2 ), (12) were w 1 and w 2 are functions of scalar (1D) arguments, is commonly applied. If te window is rectangular all observations enter in te criteria (9) wit equal weigts. Nonrectangular windows suc as triangular, quadratic, Epanecnikov, and so on (see [7,16,30]) usually prescribe iger weigts to observations wic are closer to te centre x. Some typical 1D window functions used in local regression estimates are sown in Table 1. Let us also mention windows conventional in signal processing and associated wit te names: Kaiser, Hamming, Bartlett, Blackman, Cebysev, etc. Note tat te B-splinesalso can be used as te windows for te LPA. Tere is a simple way to generate nontrivial 2D windows different from te multiplicative ones (12). Let us replace te argument x 1 in w 1 (x 1 ) by te norm x, werex is a vector and te norm is not exclusively Euclidian. Ten after te corresponding normalization we obtain 2D window functions satisfying (11). Let ŷ (x) be te LPA estimate of y(x), were te subindex sows a dependence of te estimate on te scale parameter. Tis LPA estimate of y(x) is defined according to (6) as ŷ (x) = y(x,v) v=x = y(x,x), i.e., te expansion (6) is used for calculation of te estimate for v = x only.

7 V. Katkovnik / Digital Signal Processing 15 (2005) Actually, it is one of te key ideas of te pointwise nonparametric estimate design. We introduce te estimate as te expansion in te local neigborood of te point x, we estimate te coefficients of tis expansion, and finally we use tis expansion only in order to estimate at tis argument value x. Ten, it follows from (6) tat ŷ (x) = y(x,v) v=x = C T φ(0) (13) and for te polynomials (7) it yields ŷ (x) = y(x,v) v=x = C 1. (14) Let ŷ (k) (x) be te estimator of a kt derivative y(k) (x) of y(x).herek = (k 1,k 2 ) is a multiindex wit k 1, k 2 nonnegative integers. Te LPA model (6) of te power m can be used for estimation of any derivative of te order k, k m. According to te idea of te pointwise estimation we derive tese estimates in te form ŷ (k) (x) = k y(x,v) v k 1 1 vk 2 = ( 1) k C T φ (k) (0), 2 v=x φ (k) (0) = k φ(x) x k 1 1 xk 2. (15) 2 x=0 Tis definition of te derivative estimator assumes tat differentiation in (15) is done wit respect to v as te approximation y(x,v) is a function of v provided tat te LPA centre x is fixed. After te differentiation we assume v = x. For te polynomials (8) te derivative estimates (15) are simple: ŷ (1,0) (x) = C 2, ŷ (0,1) (x) = C 3, ŷ (2,0) (x) = C 4, ŷ (0,2) (x) = C 5, ŷ (1,1) (x) = C 6, etc. (16) Tus, te coefficients of te LPA model (6) and (7) gives te estimates of te function and of te corresponding derivatives. Tis link of te coefficients C wit te function and derivative estimation is important for understanding of te LPA. Te idea of te local approximation is applicable not only for te polynomials in te form (7) but also for different polynomials as well as for any basis functions φ, wic are reasonable for te local fit. In tis case, te estimates of te function and te derivatives are defined by te general formulas (13) and (15) and eac estimate (function and derivative) can depend on all items of te vector C. Te correspondence of te function and derivative estimates wit te items of te vector C sown in (14) and (16) is valid only for te polynomials (7). It deserves to be mentioned tat (15) is not a unique definition of te derivative estimate. Te estimate of te derivative can be defined as te corresponding derivative of te function-estimate ŷ.ten ŷ (k) (x) = k x k 1 1 xk 2 2 ŷ (x). (17) In general, te estimates (15) and (17) can be quite different at least te derivatives of te window function w appear in (17) wile tey do not appear in (15) [14,16].

8 80 V. Katkovnik / Digital Signal Processing 15 (2005) Nonomogeneous kernel estimates According to (9) te coefficients C k, k = 1,...,M, ave to be found as a solution of te following quadratic optimization problem: Ĉ(x,) = arg min J (x). (18) C R M In tis notation te dependence of te solution Ĉ(x,) on te scale and te variable x is empasized. Te solution is in te form Ĉ(x,) = Φ 1 w (x v)φ(x v)z(v), v X Ĉ(x,) = ( Ĉ 1 (x, ),..., Ĉ M (x, ) ) T, (19) Φ = v X w (x v)φ(x v)φ T (x v), (20) provided tat det Φ 0. If te matrix Φ is singular a pseudoinverse Φ # can be used for solution of (18). Substituting Ĉ(x,) (19) into (13) and (15) instead of C we ave te function and te derivative estimates in te kernel form: ŷ (x) = v X g (x, v)z(v), and g (x, v) = w (x v)φ T (x v)φ 1 φ(0), (21) ŷ (k) (x) = v X g (k) (x, v)z(v), g (k) (x, v) = ( 1) k w (x v)φ T (x v)φ 1 φ(k) (0), (22) were for te polynomials (7) φ(0) =[1, 0,...,0,...,0] T (23) is a zero vector-column M 1 wit only 1t elements equal to 1, and φ (k) (0) =[0,...,0, }{{} 1, 0,...,0] T (24) k 1,k 2 t is also a zero vector-column M 1 wit te only element equal to 1 corresponding to location of te polynomial x k 1 1 xk 2 2 /k 1!k 2! in te vector φ. Tus, te LPA estimates are presented in te form of te linear filters (21) and (22), were for te estimates ŷ (k) and for te kernels g (k) te subindex indicates a dependence on tis important scale parameter. It is assumed in (2) tat X is te regular grid. However, te estimates in te form (18) (22) are quite universal. Tey can be applied to any data given on regular or irregular grids,

9 V. Katkovnik / Digital Signal Processing 15 (2005) in particular, to data wit lost observations and for data interpolation wen te centre x of te LPA does not belong to te grid of te observations X. It is assumed in te above formulas tat te summations v X is performed witin boundaries of te image support X. It means an accurate LPA fitting inside of tese boundaries. Tere is no boundary problems for tese estimates typical for convolution estimates considered latter. Te pointwise LPA estimates as it is in (21) and (22) insure te reproducing properties of te estimates wit respect to te polynomial components of y, i.e., for any polynomial y m wit te power less or equal to m te estimates of te function and te derivatives are accurate: g (x, v)y m (v) = y m (x), g (k) (x, v)y m(v) = k y m (x) x k, x R 2. (25) v X v X Concerning te terminology we note tat in statistics te weigts g (x, v) and g (k) (x, v) are named kernels and te estimates (21) and (22) are kernel estimates. Te term bandwidt is used in statistics for te window size (scale) parameter. In image processing te term mask is commonly used for te weigts g (x, v) and g (k) (x, v). If te window w as a finite support. For example w(x) = 0if x = x1 2 + x2 2 > 1. Ten w (x) = 0for x >. Tus, te parameter defines te window size as well as te support of te masks g (k) (x, v) and g (x, v). Te mask wit a finite support defines a finite impulse response (FIR) linear 2D filter Homogeneous kernel estimates In tis paper we mainly concern in a special case of te LPA estimate wen te kernels are omogeneous sift-invariant depending on te difference of te arguments x and v only. Ten, te estimates can be presented as convolutions of tese kernels and 2D image data. It appens if te grid X in te formulas for g and g (k) is regular infinite, X ={x 1 = s 1, x 2 = s 2, s 1,s 2 Z}, werez is a set of integers. In tis case te matrix Φ in (21) and (22) dependents only on te difference x v and te kernels defined by te formulas g (x, v) = g (x v), g (k) (x, v) = g(k) (x v), x,v X. Ten, te estimates (21) and (22) can be represented as te convolutions: ŷ (x) = v g (x v)z(v) = v g (v)z(x v), (26) ŷ (k) (x) = v g (k) (x v)z(v) = v g (k) (v)z(x v), (27) wit te sift-invariant kernels g (x) = w (x)φ T (x)φ 1 φ(0), (28) g (k) (x) = ( 1) k w (x)φ T (x)φ 1 (0), (29)

10 82 V. Katkovnik / Digital Signal Processing 15 (2005) Φ = v w (v)φ(v)φ T (v). (30) Hereafter v means te double sum s2 s1 = = over te infinite regular 2D grid wit v 1 = s 1 and v 2 = s 2. Te convolutions (26) and (27) assume tat te observations z(x) defined on te finite grid (2) are completed by zeros (zero padded) outside of tis finite grid for te infinite regular grid. In wat follows te conventional compact notation is used for te convolutions (26) and (27) ŷ (x) = (g z)(x), (31) ŷ (k) (x) = ( g (k) z) (x). (32) Te kernels (28) and (29) satisfy to te polynomial vanising moment conditions following from (25): g (v)v r = δ r,0, 0 r m, (33) v 1 r! v g (k) (v)vr = ( 1) k δ k1,r 1 δ k2,r 2, 0 k m, 0 r m. (34) Te multi-indexes notation means ere tat k = (k 1,k 2 ), r = (r 1,r 2 ),andv r = v r 1 1 vr 2 r =r 1 + r 2,1/r!=(1/r 1!)(1/r 2!). Te vanising moment conditions define te polynomial smootness of te kernels. Te support of te kernels is identical to te support of te window w. Te first condition (33) means tat (26) is a smooting operator of te order m. Te second condition means tat te kernel g (k) (x) defines te differentiating operator (27) of te order m giving te estimate of te derivative k 1+k 2 / x k 1 1 xk 2 2. Te bot smooting and differentiating operators give te accurate results for any polynomial y of te power less or equal to te order of te kernels m Integral omogeneous estimates 2, Provided tat te sampling interval and te scale are small, suc tat, 0and /, te discrete convolutions (31) and (32) are transformed to te corresponding integral forms ŷ (x) = 1 ( ) 2 g (x u)/ y(u)du = g(u)y(x u) du, (35) R 2 R 2 ŷ (k) (x) = 1 (k) ( 2+ k (x u)/ ) y(u)du = 1 k R 2 R 2 g g (k) (u)y(x u) du, x R 2, (36)

11 V. Katkovnik / Digital Signal Processing 15 (2005) wit te kernels g(x) = w(x)φ T (x)φ 1 φ(0), (37) g (k) (x) = ( 1) k w(x)φ T (x)φ 1 φ (k) (0), (38) Φ = w(x)φ(x)φ (x) du, (39) R 2 + ( )du 1 du 2. were u = (u 1,u 2 ), R 2 ( )du= + Some extra tecnical assumptions are required for te existence of te above integrals and justification of te corresponding limit passages from te sums (31) and (32) to te integrals (35) and (36). Tey are satisfied, in particular, if te window w is a bounded finite support function and y(x) is continuous. For te integral kernels te vanising moment conditions (33) and (34) are as follows: R 2 1 r! g(x)x r dx = δ k,0, 0 r m, (40) R 2 g (k) (x)x r dx = ( 1) k δ k1,r 1 δ k2,r 2, 0 k m, 0 r m. (41) 3.5. Restricted nonlinear LPA estimates An 8 bit gray-scale image is defined by its intensity function taking 2 8 = 256 different values. Tis intensity is nonnegative and takes values from 0 up to 255. After normalization tese conditions ave a form 0 y(x) 1. (42) Tus, y is an arbitrary nonnegative function normalized to te interval [0, 1]. Tese conditions can be naturally incorporated in te LPA estimate by modifying (18) to te constrained optimization: Ĉ(x,) = arg min J (x), (43) 0 C 1 1, C R M 1 were C = (C 2,...,C M ) T is te vector C wit te excluded first item C 1. Te constrains (42) according to (14) can be imposed only on C 1 wile all oter items of C continue to be unconstrained. After te estimates of C are found from (43) te intensity and te derivatives are calculated according to te usual formulas (14) and (16). Concerning te estimate (43) a number of moments can be noted. First, in general, tis estimate is nonlinear wit respect to te observations. Its calculation is a muc more complex problem as compared wit te linear estimate. However, if te linear estimate Ĉ 1 (x, ) found from (18) belongs to te interval [0, 1] ten te solutions of (18) and (43) are identical. It gives a simple logic to deal wit tis nonlinear problem. We start from calculation of te linear estimates for all x and test (42). In tis way we identify te pixels

12 84 V. Katkovnik / Digital Signal Processing 15 (2005) violating te conditions (42) and te nonlinear constrained estimates (43) are calculated only for tese pixels. Second, if te linear estimate does not satisfy to (42) te solutions of te unconstrained and constrained optimizations can be different by all elements of te vector C not only by C 1. Te constrains on te intensity function can influence te estimates of te derivatives and enable one to yield bot te more accurate estimate of te function as well as te derivatives. Te following compromise is used in order to avoid te complexity of (43). Te estimates are obtained in two independent steps. Te first step is a solution ignoring te constrains on C 1. Te second step defines te estimates according to te equations: ŷ (x) = [ (g z)(x) ] +, (44) ŷ (k) (x) = ( g (k) z) (x), (45) were [ ] + stays for te projection on te segment [0, 1], wic means [x] + = x for 0 < x 1, [x] + = 0forx 0, and [x] + = 1forx>1. Tus, te estimate of C 1 obtained from te unconstrained optimization is replaced by its projection on te interval [0, 1].All oters elements of te vector C are assumed to be equal to te corresponding items of te vector Ĉ(x,). Te estimate built in tis way can be treated as an approximation of te accurate nonlinear constrained solution (43). It is a conventional practice in image processing to take into consideration te nonnegativity and upper bound of te image intensity by te simple projection of te estimate on te interval [0, 1]. In wat follows for te sake of simplicity we use te linear estimate given as te convolutions (26) and (27). 4. LPA accuracy 4.1. Asymptotic bias and variance A value of te scale is a crucial point in te efficiency of te local estimators (e.g., [7,16,19,30]). Wen is relatively small, te LPA gives a good smoot fit of functions but ten fewer data are used and te estimates are more variable and sensitive wit respect to te noise. Te best coice of involves a tradeoff between te bias and variance of te estimate. In order to clarify and formalize te meaning of tis tradeoff we present some accuracy results. In wat follows in tis capter we present te accuracy results in te terms of te kt derivative estimation as te function estimation is a special case wit k = 0. Te estimation error is a difference between te true y (k) and te estimate ŷ (k) : e y (k)(x, ) = y (k) (x) ŷ (k) (x). Tis error is composed from te systematic (bias) and random components corresponding to te deterministic y and te random noise ε. We ave, respectively, for te bias m y (k)(x, ) = y (k) (x) E { ŷ (k) (x)},

13 V. Katkovnik / Digital Signal Processing 15 (2005) and for te variance σ 2 y (k) (x, ) = E {[ y (k) (x) E { ŷ (k) (x)}] 2 }, were E{ } denotes te matematical expectation calculated over ε. Te asymptotic formulas for m y (k)(x, ) and σ 2 (x, ) can be given in te integral form y (k) wit te analytical dependence on te scale. Tis sort of results are basic for teoretical accuracy analysis and scale optimization. Hypoteses assumed: (H1) y is continuous and r-differentiable at te point x suc tat y F r ( L r ) (4). (H2) Te window w is finite support continuous. Let us use te notation M = min { m + 1, r }, (46) were m is te order of te LPA and r is a multiple-index defining te smootness (derivative order) of y in te class F r. Let te kernel LPA estimates be defined by (31) and (32), were k <M,andte sampling interval and te scale parameter be small suc tat, 0, and /. Provided tat te ypoteses (H1) and (H2) old and te derivatives y (r) (x) in (4) are continuous Lipsitz functions y (r) (x) y (r) (y) L x y γ, γ >0, (47) te bias and te variance of te estimate ŷ (k) (x) are defined by te formulas m y (k)(x, ) = ( 1) M 1 M k y (r) (x) 1 u r g (k) (u) du + o( M k ), (48) r! σ 2 y (k) (x, ) = σ k R 2 r =M R 2 ( g (k) (u) ) 2 du+ o ( k ), (49) were g (k) is given in (38) and a small o(x) means tat o(x)/x 0asx 0. Te derivation of tese formulas is based on quite routine tecniques using, in particular, te multivariable Taylor series for y(x u) on. Hypotesis (H2) and (47) enable te existence of te integral estimates (35) and (36) as well as te corresponding limits and integrals in (48) and (49). Omitting te small terms o( ), te following inequality can be derived from (48) my (r)(x, ) M k L M (x)a (k) e, A(k) e = 1 r! r g (k) (u) du, (50) r =M R 2 u were max r =M r y(x)/ x r L M (x), according to (4), and (49) is rewritten as σ 2 y (k) (x, ) = σ 2 2 e, B e (k) ( = g (k) (u) ) 2 du. (51) B(k) 2+2 k R 2

14 86 V. Katkovnik / Digital Signal Processing 15 (2005) We can see from (50) tat te bias error is restricted by L M (x) wat is te absolute value of te maximum Mt order derivatives of y(x). Te constant A (k) e is defined by te momentsof te kernel.te scale is an important parameter for te bias: a small means a small bias. Te upper bound of te bias is restricted by te value of of te power M for te function estimation and of te power M (k 1 + k 2 ) for te derivative estimation. For small it gives M k 1 k 2 > M. Tus, te bias of te derivative estimate is asymptotically larger tan te bias of te function estimate. Note also tat M = min{m + 1, r } means tat wile we increase te power m of te LPA te bias becomes smaller. However, tis effect is valid until m + 1 < r. Te variance formula (51) sows tat te variance for te derivative estimation (of te order 2(1+k 1+k 2 ) ) is larger tan te variance for function estimation (of te order 2 ). Tus, in terms of te bias error as well as of te level of te random error te derivative estimation is a more complex problem tan estimation of te function itself. Te iger order derivative (larger k 1 and k 2 ) automatically means tat te estimation errors become larger Scale optimization Te formulas (50) and (51) define te pointwise mean squared risk r (k) (x, ) for te kt derivative estimation as r (k) (x, ) E ( y (k) (x) ŷ (k) (x)) 2 ( M k L M (x)a (k) ) 2 + σ 2 2 e B(k) 2+2 k e. (52) Te upper bound of r (k) (x, ) is convex on. Its minimization on gives te ideal value of te scale found from te equation [ ( M k L M (x)a (k) ) 2 2 σ 2 ] e + B(k) 2+ k e = 0. Furter calculations give for tis ideal scale ( k (x) = 2 σ 2 B e (k) ) 1/(2M+2) (L M (x)a (k) e ) γ 2 2 k, γk 2 = 1 + k M k, (53) were γ k is te ratio of te absolute value of te bias to te standard deviation of te derivative estimate at = k (x) γ k = my (k)( x, k (x) ) / σ y (k)( x, k (x) ). (54) Tis parameter is a constant (invariant on x) depending on M = min{m + 1, r } and k only. Tus, it depends on te LPA power m, te smootness of te function defined by r and te order k of te estimated derivative. It can be verified also tat { ( ) my (k) x,(x) <γk σ y (k)(x, ), if < k (x), >γ k σ y (k)(x, ), if > k (x). (55) It sows tat γ k σ y (k)(x, ) is a critical value for te ideal bias. For < k (x) te bias is smaller and for > k (x) te bias is larger tan tis critical value. Te ICI rule for

15 V. Katkovnik / Digital Signal Processing 15 (2005) te adaptive varying scale selection considered in Section 4.3 is based on testing of te ypoteses: k (x). Let te number of observations (image samples) n and te sampling interval satisfy to te equation n 1/ 2, wic means tat te pysical size of te image is 1 1. By substituting k (x) from (53) into (52) we obtain te ideal pointwise mean squared risk. To be interested only in te order of tis risk for large n we immediately can see tat r (k)( x, k (x)) = O ( n (M k )/(2M+2)). (56) Tis risk approaces zero for a large number of observations and proves te mean square convergence of te studied estimates. Te convergence rate is defined by te parameter ψ = M k 2M + 2 min{m + 1, r } k =, k <M. (57) 2min{m + 1, r } + 2 It is seen from (57) tat larger values of M result in te iger convergence rate wic is restricted by te limit value O(n 1/2 ) for large M. Te converge rate (56) is an ideal one and cannot be acieved in practice as it requires a knowledge of te derivatives of te estimated function for te ideal varying scale selection. However, it gives an useful information on te potential accuracy wic can be used as a bencmark for evaluation of algoritms. Assuming k = 0 te corresponding results can be obtained from te above formulas for te function estimation. Te ideal scale is ( 0 (x) = 2 σ 2 B e (0) ) 1/(2M+2) (L M (x)a (0) e ) γ 2, γ 2 = 1 2 M, (58) and te convergence rate following from (56) is defined as r (0)( x, 0 (x)) = O ( n M/(2M+2)). Te convergence rate of te derivative estimates is slower tan tat for te function estimation Adaptive scale selection Te problem of te scale (window size) selection is always solved in favor of te larger scale for any parametric estimate as tis estimate is unbiased for any scale and te larger scale means a smaller variance. For te nonparametric estimation te situation is more complex. If tere is no noise te scale sould be selected as small as possible since a smaller scale means a smaller bias. However, if tere is a noise te scale sould be increased in order to suppress noise effects. Te accuracy analysis produced in Section 4.2 confirms and illustrates tis claim. It demonstrates tat tere is te ideal scale wic defines te optimal mean squared balance between te deterministic bias errors and te variance of random errors. In many signal processing applications tis nonparametric nature of te local approximation is ignored. It is assumed tat te polynomial model is accurate, noise effects are neglected and te window is taken of te minimum size sufficient for te polynomial fit.

16 88 V. Katkovnik / Digital Signal Processing 15 (2005) For instance, we may mention most conventional differentiators as well as metods based on te fit by te ortogonal polynomials (e.g., [8,12]). A number of publications concerning te scale selection is very large and growing quickly. Wile a review of te field is beyond te scope of tis paper, we give few references illustrating te basic progress in various directions. Two approaces ave been first exploited for adaptive (data-driven) invariant scale selection of te LPA estimates. Te first one is based on estimation of te biasedness m y (x, ) and te variance σy 2 (x, ) of te estimates wit te ideal scale calculation according to teoretical formulas. However, tis bias depends on te derivatives of te signal in question. Tus, in order to find te adaptive, say from te formula (58), one needs estimate tese derivatives. Tis sort of metods, known as pilot estimates, are quite complex in implementation and ave a few design parameters. Neverteless, successful metods ave been developed based on tese ideas and reported by several autors [7,30,36,37]. Te second alternative approac, also for te adaptive invariant scale selection, does not require estimation of te bias. Tese group of metods are based on te quality-of-fit statistics suc as cross-validation, generalized cross-validation, C p, Akaike criteria, etc., wic are applied for direct optimization of te accuracy (e.g., [7,13,16,30] and references erein). Te linear LPA wit te varying scale found by minimization te so-called pseudomean squared error is considered in [32]. Te target point is left out of te averaging in te pseudo-mean squared error wat differs tis estimate from te standard mean square metods. It is reported tat te proposed pseudo-mean squared error works better ten te local cross-validation. A recent break-troug in pointwise varying scale estimation adaptive to unknown smootness of te function is originated from a general sceme of Lepski [28,29,38] already mentioned in te introduction. Te LPA estimates are calculated for a grid of scales and compared. Te adaptive scale is defined as te largest of tose scales in te grid wic estimate does not differ significantly from te estimators corresponding to te smaller scale. Tese type metods first proposed in few cited above papers for 1D nonparametric estimation are mainly differ in (1) grid of window sizes (scales), (2) accuracy criteria, (3) statistics used for scale selection. Te intersection of confidence intervals (ICI) rule can be treated as quite a different implementation of te basic Lepski s approac idea [9,17]. It is proved tat te LPA equipped wit te ICI rule for te scale selection possesses simultaneously many attractive asymptotic properties, namely, (1) it is nearly ideal witin ln n-factor for estimation of a function (or its derivative) at a single point; (2) it is spatial adaptive in te sense tat its quality is close to tat one wic could acieve if smootness of te underlying function was known in advance; (3) it is optimal in order or nearly-optimal in order witin log n-factor for estimating wole function (or its derivatives) over wide range of classes and global loss functions [9]. Tese results demonstrate tat te spatial adaptive abilities of tese estimates are te best possible in te terms of te asymptotic analysis. Tese results ave been extended to multivariable functions provided tat te scale parameter of te estimate is scalar [33]. Similar asymptotic accuracy results are proved for different versions of Lepski s algoritm.

17 V. Katkovnik / Digital Signal Processing 15 (2005) Experimental study reveals tat a nonasymptotic performance of te ICI rule depends essentially on te tresold parameter te confidence intervals used in te ICI algoritm [17]. It is sown tat te cross-validation is able to give adaptive values of te tresold parameter improving te estimation accuracy. Various modifications of te ICI rule are appeared to be efficient for different scale adaptive applications: median filtering [20], beamforming [21], time-frequency analysis [4,22]. An application of te ICI rule to image denoising and deconvolution as been reported in [18,19]. Tis development of te ICI for te 2D image intensity function exploits te 2D quadrant (symmetricand nonsymmetric) windows wit adaptivevarying scale parameters. A special version of te Lepski s spatially adaptive metod is proposed and analyzed in [38]. First of all, a set of test-windows is proposed, wic enables a fine cover of a neigborood of te estimation point. Furter, te used test-statistics are based on te residual of estimation, wile te original Lepski s algoritms use te function estimates only. Te accuracy analysis produced in [38] for estimation at far and near cange points sows tat te estimates are nearly optimal witin te usual log n-factor unavoidable for te adaptive estimation convergence rate. A2D generalization of te algoritm from [38] is proposed for image denoising in [35]. It is assumed tat te image intensity is an unknown piece-wise constant function. Te estimate is a sample mean calculated in te sliding varying adaptive size window. Te main algoritmic novelty concerns te 2D window sets and test-statistics design. For te test-statistics te differences between te estimate in te tested window and te estimates in some subwindows of te tested window are calculated. Near optimal estimation accuracy is proved for te pixels far and near an edge. Basic papers concerning te Lepski s approac are publised mainly in matematical statistics journals and concentrated on teoretical analysis of te accuracy and adaptivity properties for various classes of te functions to be estimated. Some recent results concerning a development of te adaptive scale multivariable estimation can be seen in [26] were te optimal kernels are derived for different classes of functions to estimate. To complete tese introductory notes we wis to mention a new development concerning a generalization of te approac to multivariable kernel estimates, x R d,using different scales for tese variables [25]. Te adaptation becomes anisotropic and assumes selection of te multivariable scale parameter R d. Tis generalization is of special interest as tere are some principal obstacles for tis sort of multivariable adaptivity Lepski s approac Let us start from te idea of te Lepki s approac. Introduce a set of te scales H ={ 1 < 2 < < J }, (59) starting from a small 1 and increasing to a maximum J,andletŷ be te estimate of y defined for H wit te estimation standard deviation σ y (x, ). Accordingly to te accuracy analysis produced above, for small te estimate ŷ as a small bias and a large standard deviation of te random noise. Te adaptive scale algoritm compares te estimates wit increasing. Te main intention is to find te maximum scale wen te estimate s deviation can be explained by te random component of te estimation error

18 90 V. Katkovnik / Digital Signal Processing 15 (2005) and tere is a balance between te biasedness and randomness in te estimate. Te accurate meaning of tis sort of balance is discussed above. Te Lepski s approac defines te adaptive scales according to te conditions ĥ(x) = max { H : ŷ (x) ŷ η (x) T(,η,x)for all η<, η, H }, (60) were T(,η,x)is a given tresold. Te procedure (60) is looking for a largest scale in order to obtain te maximum smooting effect for te random errors. However, a large can result in a significant bias error. All estimates ŷ η (x) wit te scale η<are compared wit te estimate ŷ (x) of te scale. If te differences ŷ (x) ŷ η (x) can be explained by te random errors te bias is not large and larger can be selected. Te adaptive scale ĥ is defined as maximum in H suc tat all estimates ŷ η (x) wit η<are not too different from ŷ (x). Te multiple comparison of te estimates wit te different scales is used. Te parameter T(,η,x)is a key element of te algoritm as it says wen te difference between te estimates is large or small. Te procedure (60) enables a multiple statistical test on significance of te systematic error in te differences ŷ (x) ŷ η (x) in comparison wit te corresponding random errors. A variety of te Lepski s algoritms is defined mainly by te different form of te tresold T(,η,x) wic usually depends on te variances of te estimates wit te scales and η. A proper selection of tis tresold enables nice statistical properties of tese adaptive scale estimate [27 29,38]. Let us describe two different algoritms of te Lepski s class in order to illustrate te approacoverall and in order to sow tat tese algoritms are indeed simple in implementation. We also use tese algoritms for a reference and furter presentation of te novel MR local regression Lepski Spokoiny algoritm [29] Te adaptive scale + (x) is defined as follows: i + = max { i: ŷ i (x) ŷ j (x) Γ 1 ( j )σ y (x, j ), j < i, 1 i J }, + (x) = i +, (61) wit te adaptive scale estimate ŷ + (x) =ŷ + (x)(x), + (x) = i +. (62) Here ŷ i (x) compared wit all estimates aving j < i and in tis comparison te inequality in (61) is tested. As in (60) te adaptive scale + (x) is equal to te maximum i + H satisfying all of te corresponding inequalities in (61). In tis case te tresold T(,η,x) from (60) is used in te form T( i, j,x), as according to (61) j < i and T( i, j,x)= Γ 1 ( j )σ y (x, j ). Te grid H for (61) is defined inductively starting from te largest J by J k+1 J k =, k= 1, 2,...,J 1, (63) 1 + α( J k+1 ) d() = max ( 1,rlg( j /) ), α()= 1. (64) d()

19 V. Katkovnik / Digital Signal Processing 15 (2005) Te tresold Γ 1 ( j ) in (61) depending on is as follows: Γ 1 () = ( 1 + α() ) d(). (65) It is proved in [29] for some asymptotic consideration tat te algoritm give te adaptive window sizes wic minimizes te risk E{ ŷ(x, i ) y(x) r }, r 1. Te r is a parameter used in (64). Note tat te total number of compared scales is of te logaritmic order and depends on te maximum J ICI algoritm [9,17,19] Being from te Lepski s class tis algoritm is derived from different speculations and as quite a different recursive structure. Determine a sequence of te confidence intervals Q j of te estimates ŷ j (x) Q j = [ ŷ j (x) Γ σ y (x, j ), ŷ j (x) + Γ σ y (x, j ) ], (66) were Γ is a tresold parameter. Consider te intersection of te intervals Q j,1 j i, wit increasing i,andleti + be te largest of tose i for wic te intervals Q j ave a point in common. Tis i + defines te adaptive scale and te adaptive LPA estimate as given by (62). Te following algoritm implements te ICI rule. Determine te sequence of te upper and lower bounds of te confidence intervals Q j as follows: and let Q j =[L j,u j ], U j =ŷ j (x) + Γ σ y (x, j ), L j =ŷ j (x) Γ σ y (x, j ), (67) L j+i = max{ L j,l j+1 }, U j+1 = min{u j,u j+i }, j = 1, 2,...,J, L 1 = L 1,U 1 = U 1. (68) According to tese formulas L j+1 is a nondecreasing sequence and U j+1 is a nonincreasing sequence. Find te largest j wen L j U j, j = 1, 2,...,J, (69) is still satisfied. Denote tis largest value as i +.Tisi + is te largest of tose j for wic te confidence intervals Q j ave a point in common as it is discussed above and te ICI adaptive scale is + = i +. It is a procedure for a fixed x giving te varying adaptive scale + (x). Figure 1 illustrates tis algoritm. In te ICI algoritm te estimates of te different scale are compared by using teir confidence intervals. We may conclude tat te confidence intervals Q i and Q j intersect if and only if ŷi (x) ŷ j (x) Γ ( σy (x, i ) + σ y (x, j ) ).

20 92 V. Katkovnik / Digital Signal Processing 15 (2005) Fig. 1. Intersection of confidence intervals (ICI) rule for te adaptive varying scale selection. Te confidence intervals are sown for j H. Te vertical lines wit arrows sow te successive intersections of te confidence intervals (1, 2), (1, 2, 3) and (1, 2, 3, 4). Assuming tat te intersection wit te fort confidence interval (corresponding = 4 ) is te last nonempty one we obtain te adaptive scale as equal to 4. Ten te ICI rule is reduced to Lepski s sceme (60) wit T(,η,x)= Γ ( σ y (x, ) + σ y (x, η) ). Te teoretical analysis produced in [9] for 1D regression sows tat te ICI adaptive scale estimate gives te best possible pointwise and global MSE. A generalization of tis result for a multidimensional regression wit te scalar is done in [33]. In te asymptotic analysis most of te scale adaptive nonparametric regression algoritms are equivalent in terms of te convergence rate. However, simulation sows tat practically te efficiency of te algoritms can be quite different. It deserves to be mentioned tat te similar asymptotic properties concerning te convergence rate as well as te classes of adaptivity are known for te wavelet tecniques. Remind tat te ideal scales (53) balance te bias-variance tradeoff. Tis balance depends on L m (x), i.e., on te derivatives of te order M, and tese derivatives are unknown in advance. Te order of tese derivative M = min(m + 1, r ) depends on te parameter r wic also is unknown. Te ICI rule gives te adaptive scales close to te ideal ones. Te confidence intervals Q j (67) used in te ICI depend on te estimates and te standard deviations σ y (x, j ) (48) only and do not use te parameter r as well as te derivatives of y. Tus, te ICI rule produces te estimates wic are spatially adaptive to unknown varying smootness of te estimated signal y.

21 V. Katkovnik / Digital Signal Processing 15 (2005) MR nonparametric regression 5.1. Nonparametric regression spectrum Let us introduce a finite set of scales H H ={ 0 > 1 > > J }, (70) starting from a largest 0 and decreasing to a smallest J. Tus, H is a set of te descending scales, wile H in (59) is a set of te ascending scales. We consider te convolution image intensity estimates as it is defined by (31) and assume tat for te smallest scale J te LPA kernel g (u) is te identical operator lim g (x) = δ x,0. (71) J Tis assumption is not restrictive and only defines a range of scales starting from a large 0 and going to sufficiently small J. For instance, if te window in (28) is an indicator suc tat w(x) = 1for x 1 < 1, x 2 < 1 ten te LPA wit m = 0 insures tat g J (x) = δ x,0 for J = 1. If te window w is te 2D standard Gaussian density, ten in w(x/)/ 2 (28) is te standard deviation of tis distribution and te LPA defines te discrete kernel suc tat lim 0 g (x) = δ x,0. In tis case J 0. Furter, we assume tat all kernels g, H, ave finite supports embedded into a finite rectangular regular grid U. If te support of te kernel is smaller tan U te kernel is completed by zeros in order to ave g (x), H, defined for all x U. Let us start from a simple decomposition of te function estimate ŷ (31) in a sum of differences of various scale estimates: J Ŷ(x,β) =ŷ 0 (x) + β j ŷ j (x), ŷ j (x) = ( g j z)(x), (72) j=1 g j (x) = g j (x) g j 1 (x) for j = 1, 2,...,J, (73) were ŷ j (x) =ŷ j (x) ŷ j 1 (x) and β = (β 1,...,β J ) T is a vector of coefficients. Consider te items in te rigt and-side of (72). Te difference ŷ j is a deviation of te estimate caused by a decrement of from j 1 to j. Te largest values = 0 means a coarser scale and strong smooting kernel wit ŷ 0 (x) = (g 0 z)(x) being te smootest estimate giving te low frequency picture of y. Smaller corresponding to a finer scale detecting iger frequency details in te image. In te sum (72) te first term ŷ 0 presents te smooted background of te image wile te oters add details of different scales. In tis way (72) is a decomposition of te image y in te different scale components. Let g j be a spectral (scale) analysis kernel and Ŷ(x,β) be a spectral (scale) decomposition. Te following properties are easy to verify. (1) According to (33) te analysis kernels, g j (x), j = 1,...,J, ave vanising moments up to te order m g j (x)x k = 0, 0 k m, j = 1,...,J. (74) x U

22 94 V. Katkovnik / Digital Signal Processing 15 (2005) Tus, te analysis kernels ave te polynomial smootness defined by te power m of te LPA. (2) Te sum of te analysis kernels g j assuming g 0 = g 0 is te identical operator J g j (x) = g J (x) = δ x,0. (75) j=0 (3) For any H te estimate ŷ can be represented in te form ŷ (x) = Ŷ ( x,β() ), (76) were te coefficients β j () in (72) are defined by te indicator function β j () = 1[ j ]. (77) (4) For β j = 1for1 j J, a perfect reconstruction of y as a place ŷ 0 (x) + J ŷ j (x) = y(x). (78) j=1 Equation (76) is verified substituting β j given by (77) in (72). Tus, a varying in (77) controls a number of spectral items in te expansion (72) and in tis way it varies Ŷ(x,β( J )) from te perfect reconstruction of te observed y(x) = Ŷ(x,β( J )) to te most smooted estimate ŷ 0 (x) = Ŷ(x,β( 0 )). Te problem of te adaptive scale selection for te estimate ŷ (x) can be formulated as selection of in β() (77) for te estimate (72). For invariant on x (77) gives te same scale selection for all x wile dependent on x results in te pointwise varying scale ( ) [ β j (x) = 1 j (x) ]. (79) 5.2. Multiresolution analysis Developing furter te concept of te local regression spectrum we assume tat te coefficients β j in (72) may be not binary. Ten, we arrive to te idea of nonparametric estimation of y based on te spectral decomposition (72) wit some estimates of te spectral coefficients β j. In tis way we break wit te traditional statistical approac to local regression assuming tat in ŷ is te only scale parameter defining te estimate. In order to make tis approac more constructive we replace te initial spectral analysis kernels g j by teir ortogonal counterparts. Let te kernels g j (x), j = 0, 1,...,J, x U, beasetof(j + 1) linear independent functions wit te bounded Euclidean norms, g j 2 = x U ( g j (x)) 2 <. Ten, te standard Gram Scmidt procedure gives ω j+1 (x) ortogonal wit respect to ω k (x), 0 k j,forx U as follows: j g j+1,ω k ω j+1 (x) = g j+1 (x) ω k 2 ω k (x), k=0 ω 0 (x) = g 0, j = 0,...,J 1, (80)

23 V. Katkovnik / Digital Signal Processing 15 (2005) were te inner product, means g,ω k = x U g (x)ω k (x) and ω k 2 = ω k,ω k = x U ω2 k (x). Replace tese ω j by te normalized kernels ω j / ω j. Ten te ortonormal vectorfunctions ω(x) = (ω 0 (x), ω 1 (x),..., ω J (x)) T and g(x) = ( g 0 (x), g 1 (x),..., g J (x)) T satisfy to te Gram Scmidt equation Qω(x) = g(x) for all x U, (81) were Q = (Q j,k ) j,k=0,1,...,j is a nonsingular lower triangular (J + 1) (J + 1) matrix. Let α j be te output of te filter wit te kernel ω j and te input is te accurate signal y. Ten α j (x) = (ω j y)(x) (82) and te following can be verified: (1) Te outputs α j define a spectral analysis wit components varying from a low frequency (coarse scale) base image α 0 to iger frequency (finer scale) image increments α j. Higer value j corresponds to a iger frequency spectral component of te spectrum. (2) Te spectral kernels ave vanising moments up to te order m (ω j x k )(0) = 0, 0 < k m, j = 1,...,J. (83) Note tat contrary to (74), in general, (ω j x k )(0) 0fork = 0. Let l 2 (Z 2 ) be a space of square summable 2D functions y defined on te infinite regular grid X, i.e., y l 2 (Z 2 ) if s Z 2 y2 (s ) <. Introduce accumulated kernels J Ω j (x) = ω j (x), j = 0, 1,...,J. (84) j=0 Define linear spaces W j and V j generated by te kernels ω j and Ω j, respectively: { W j = y j (x) = } ω j (u)y(x u): y l 2 (Z 2 ), j = 1,...,J, (85) u U { V j = y j (x) = } Ω j (u)y(x u): y l 2 (Z 2 ), j = 0, 1,...,J. (86) u U It can be verified tat tese y j l 2 (Z 2 ) and y j l 2 (Z 2 ) for any y l 2 (Z 2 ). Tus, W j and V j are subspaces of l 2 (Z 2 ) defined as convolutions of y l 2 (Z 2 ) wit te kernels ω j and Ω j, respectively. Te kernels ω j are ortogonal. However, it does not mean tat te subspaces W j are ortogonal also. It follows from (84) tat te subspaces V j+i can be represented in te form V j+1 = V j + W j+1, j = 0, 1,...,J 1, (87)

24 96 V. Katkovnik / Digital Signal Processing 15 (2005) were te plus + stays for te sum of two subspaces. Te subspace W j+1 is a complement (nonortogonal in general) of te subspace V j. It follows tat V j+1 = V 0 + ( j+1 W k ), j = 0,...,J 1. k=1 One requirement on te sequence of subspaces V j in te MR analysis is completeness: y j (x) y(x) as j J. Te sequence V j defined in (85) is complete because of (71) V J = l 2 (Z 2 ). (88) Definition. Te sequence of spaces {V j, j = 0, 1,...,J} generated by ω j, j = 0, 1,...,J, is a MR analysis of y l 2 (Z 2 ) and ω j are MR analysis kernels defined on te scale set H. Te following is a nonparametric local regression decomposition of y basedonte analysis kernels ω j. Proposition. Let y l 2 (Z 2 ) and q j = J l=j Q l,j,wereq l,j are elements of te matrix Q in (81),ten y(x) = J α j (x) q j, (89) j=0 α j (x) = (ω j y)(x). (90) Proof of Proposition. Substituting (90) in (89) transforms tis representation in te convolution y(x) = (K y)(x), (91) J K(x)= ω j (x)q j. (92) j=0 Sow tat K(x)= δ x,0 and in tis way prove tat (91) is te identity. Indeed, it follows from (81) tat J J J J J g l (x) = Q l,j ω j (x) = ω j (x) l=0 = l=0 j=0 J J ω j (x) Q l,j = j=0 l=j j=0 l=0 Q l,j J ω j (x)q j = K(x). j=0 However, according to (75) J l=0 g l = g J (x) = δ x,0.itsowstatk(x) = δ x,0 and completes te proof of te proposition.

25 V. Katkovnik / Digital Signal Processing 15 (2005) Te formulas (90) and (89) yield te accurate MR spectral expansion of y valid for any nonparametric regression y l 2. Specifically, te equations (90) and (89) define, respectively, te analysis and syntesis steps of tis MR analysis. Te synteses formula (89) sows tat te projections α j of y on te MR analysis spaces V j taken wit te weigts q j enables te perfect reconstruction of any y l 2 (Z 2 ). We wis to mention tat te synteses formula (89) is not unique. In particular, te syntesis of te form y(x) = J α j (x)ω j (0), α j (x) = (ω j y)(x), (93) j=0 is studied in [23,24]. If g J (x) δ x,0 te formula (89) can be used as an approximate synteses formula wic gives te reconstruction of y(x) witin te bias error of te kernel operator g J.Tisbias error is analyzed in Section 4. Note tat te following generalization of (89) is valid ŷ i (x) = i α j (x)q j (i), q j (i) = j=0 i Q l,j, (94) l=j were ŷ i is te LPA estimate of te scale i MR for differentiation For smooting te accurate identity operator is assumed for te scale J in te perfect reconstruction formula (89). Tis sort of te accurate differentiation is not possible as tere is no a kernel for te discrete convolution (32) wic gives te accurate derivative for any y. Te approximate form of (89) is appropriate for differentiation of nonparametric regression functions. Let us replace g in (72) and (73) by te differentiation kernels g (k) (29). Ten, te Gram Scmidt formulas (80) defines te ortogonal differentiation analysis kernels ω (k) j and te corresponding MR differentiation subspaces. Te analysis and synteses formulas (89) and (90) are canged to te form: ŷ (k) (x) = J j=0 Here q (k) j = J l=j Q (k) l,j Scmidt equation were α (k) j (x)q (k) j, α (k) j (x) = ( ω (k) j y ) (x). (95) wit Q(k) l,j being te elements of te matrix Q(k) in te Gram Q (k) ω (k) (x) = g (k) (x), x U, (96) ω (k) = ( ω (k) 0 g (k) = ( g (k) 0 (x), ω(k) 1 (x), g(k) 1 (x),..., ω(k) J (x)) T, (x),..., g(k) J (x)) T, g (k) 0 (x) = g(k) 0 (x).

26 98 V. Katkovnik / Digital Signal Processing 15 (2005) Te ω (k) J is te finest scale differentiation analysis operator in te representation (95). Te interval of te scales from 0 to J used in (95) and specified in H defines te derivative estimates of different scales to select from or to use jointly in te combined MR estimation. As ω (k) J is not te accurate differentiating operator. Te formula (95) defines an approximate reconstruction of te derivative witin te accuracy corresponding to te derivative estimate wit te kernel ω (k) J aving te finest scale in te set H, = J Examples of kernels Some examples of te MR analysis smooting kernels ω j are sown in Fig. 2. Tese kernels are obtained according to (80) were g are derived from te LPA of te power m = 2forte2D Gaussian window w = 1 2π exp( x 2 ) truncated to te squares of te size,were H and H ={21, 11, 5, 3, 1}.TefirstMRkernelω 0 defines a smooting low-pass filter wile oters analysis kernels ω j, j = 1,...,4, define band-pass filters. Te MR differentiating kernels ω (1,0) j (x) are sown in Fig. 3. Tese kernels are obtained (x) (differentiation on x 1 ) derived by using te LPA of te power m = 2andte from g (1,0) j 2D Gaussian window w = 1 2πσ 2 exp( x 2 /σ 2 ), σ = 0.5. Tese kernels are truncated to te squares of te size, H, wit H ={21, 15, 11, 5, 3}.Tefirstω (1,0) 0 is te MR Fig. 2. MR analysis kernels ω j (x) obtained using te LPA of te power m = 2andte2D Gaussian window w(x) = 1 2π exp( x 2 ).

27 V. Katkovnik / Digital Signal Processing 15 (2005) Fig. 3. MR differentiating kernels ω (1,0) j (x) obtained using te LPA of te power m = 2andte2D Gaussian window w(x) = 1 2πσ 2 exp( x 2 /σ 2 ), σ = 0.5. differentiation kernel of te largest scale 0 = 21. Te iger scale kernels ω (1,0) j, j>0, become narrower. A specific feature of te differentiation kernels is tat all of tem ave zero value for te central pixel of te square mask. Te smooting kernels ω j (x) ave maximum peaks at te central pixel. Te support of a square mask differentiating kernel sould be larger tan 1 1 in order te LPA fitting wit m 1 would be possible. For te minimum scale differentiating kernel we use te mask 3 3, = 3. It is te minimum in te set H used for differentiating. 6. Filtering by tresolding A common underlying assumption in multiscale MR curve/surface/signal estimation is tat te function to estimate as some redundancy. Tis is often reflected by te ypotesis tat it belongs to a particular functional class. For example, it could be discontinuous but only at a limited number of places, or te function is assumed to ave only one mode or to be monotone. Ten, te euristic for te use of, say wavelets, is tat te expansion of suc a function in a wavelet basis is sparse, i.e., only a few of te wavelet coefficients are large and te rest are small and tus negligible. Hence, in order to estimate te function, one as

28 100 V. Katkovnik / Digital Signal Processing 15 (2005) to estimate te large wavelet coefficients and discard te rest. Tis approac as proved useful and successful as sown, in recent years, by various autors (see Refs. [11,31] and references erein). In wat follows we apply te tresolding tecnique to te MR local regression spectral components. Let te image observation be given by te noisy model (1) and te analysis kernels ω j be applied to tese data. Ten te noisy observations â j (x) = (ω j z)(x) (97) of te true spectral coefficients α j (x) = (ω j y)(x) are linked by te equation â j (x) = α j (x) + σn j (x), j = 0, 1,...,J, (98) were n j = (ε ω j )(x) are te zero mean Gaussian noise wit te standard deviation equal to 1. Te ortogonality of te analysis kernels ω j (x) means tat tese noises are uncorrelated for different scales j and a fixed x. Te goal is to estimate te unknown vector α(x) = (α 0 (x), α 1 (x),..., α J (x)) T from te observations (98). Wen tese estimates of α j (x) are found te function estimate can be used in te form (89) wit te true α j (x) replaced by te estimates. Tis nonparametric estimation via te MR spectral decomposition is divided into two steps. Te first step transforms te data into noisy versions of te spectral coefficients ˆα j (x). In te second step, tese estimates of te spectral coefficients are filtered using te euristic, confirmed by simulation, tat te spectral MR representation of te signal is sparse and tat te noise is evenly spread over te empirical spectral coefficients. Since te spectral MR representation usually is sparse, it is expected tat only a small fraction of te spectral coefficients is large and tat te rest is small and tus negligible. So if a spectral coefficient is small, it is reasonable to regard it as mostly noise and to set it to zero; if it is large, it is reasonable to keep it. Tis is known as a ard-tresolding. A soft-tresolding srinks everyting towards zero by a certain amount, tus reducing te variance of te estimation at te cost of a iger bias. If te noisy â j (x) (98) are substituted in (89) instead of α j (x) ten ŷ(x) = z(x), i.e., tere is no filtering because te formula (89) gives a perfect reconstruction of any input signal z(x). Te mean square error of tis straigtforward estimate is given by te formula ([ J ( E q j αj (x) â j (x) )] 2 ) J = qj 2 E{( α j (x) â j (x) ) 2}, (99) j=0 j=0 as â j (x) are uncorrelated for different j. Te additive structure of (99) wit independent contribution by te estimates of te different scales sows tat te diagonal estimation, i.e., independent estimation of α j (x) for different j is a reasonable idea. Assume tat tis diagonal estimator α j (x) of α j (x) as a linear structure [31] α j (x) = γ â j (x), (100) were 0 γ j 1 is an attenuation factor of te estimate. Ten te estimate (89) as a form ŷ(x) = J α j (x) q j = j=0 J γ j â j (x) q j. (101) j=0

29 V. Katkovnik / Digital Signal Processing 15 (2005) Let us start from te oracle linear tresolding assuming tat α j (x) is known. It gives te ideal γ j as γ j = α j (x) 2 /( α j (x) 2 + σ 2 ) [31]. Te oracle estimate cannot be realized from te data since it depends on te unknown α j (x). However, tis estimate is useful, in particular, as a benc mark for real estimates. We consider four tresolding algoritms applied to te observations (98) wic define nonlinear estimates α j (x) of α j (x) in te model (100) by selection of te attenuation factor γ j as a nonlinear function of â j (x) (e.g., [3,31]): (a) Hard-tresolding γ j (x) = 1 ( âj (x) >t σ ). (102) Hereandinwatfollowst>0. (b) Soft-tresolding γ j (x) = ( 1 t σ / â j (x) ) +, (103) were (a) + = a if a>0and(a) + = 0oterwise. (c) Stein s tresolding γ j (x) = ( 1 t σ / â j (x) 2 )+. (104) (d) Smooted Stein s tresolding γ j (x) = ( 1 t σ / ˆb j (x) 2) +, (105) were ˆb j (x) 2 is a mean value of â j (x) 2 calculated in a square (M 1 M 1 ) neigborood of te pixel x. It follows from te MR representation for differentiation (95) tat te adaptive scale derivative estimates can be given by te formulas (100) (105) provided tat â j (x) are replaced by te corresponding â (k) j (x) = (ω (k) j z)(x), wereω (k) j are defined as in (96). In tis section me exploit a few well known and efficient diagonal tresolding metods wile tere are many interesting alternatives (e.g., [3,5,6,11,31]). Te tresolding overall allows an interesting interpretation in te context of te sequence estimation or model selection framework [1]. 7. Optimality of te adaptive scale estimation Te best accuracy wic can be acieved using nonparametric regression estimates (31) and (32) equipped wit te adaptive varying scale for y from te class (4) is restricted by te convergence rate r (k)( x, + k (x)) = O (( log n n ) (M k )/(2M+2) ). (106)

30 102 V. Katkovnik / Digital Signal Processing 15 (2005) It differs by te factor logn from te formula for te ideal estimator (56) and sows tat te adaptive convergence rate is muc slower tan tat is for te ideal estimator. One of te fundamental results of te modern adaptive estimation teory says tat tis logn-factor is unavoidable in adaptive estimation. Tus, tere are no algoritms wic could acieve te better accuracy tan sown in (106). Moreover, if tis convergence rate is proved for some algoritm it means tat tis algoritm is best possible in terms of te convergence rate. Te teoretical analysis produced for 1D regression in [9] and for multidimensional regression in [33] sows tat te ICI adaptive algoritms acieve te best convergence rate and in tis way te ICI adaptation is asymptotically optimal. Similar results for different classes of function and different accuracy criteria are proved for many versions of Lepski s adaptation algoritms [27,28,35]. Te introduced MR spectral decomposition transforms te original nonparametric estimation problem into te sequence estimation framework wit te sequence of α j (x) = (ω j y)(x) defined as te projection of y(x) on te subspaces V j forming te MR analysis. In oter terms, te filtering in te domain of te original argument x is replaced by filtering in te MR spectrum domain. Te sequence estimation framework is quite different from te conventional nonparametric regression metods tat mainly exploit te smootness of te estimated function. Te sequence estimation is based on te concept tat te sparsity of representation is a more basic notion tan te smootness and tat te nonlinear tresolding can be te powerful competitor to traditional linear metods even equipped wit te adaptive scale selection algoritms (see [1,3,5,6], and references erein). A simple example illustrates a source of te possible advantage of processing in te spectral domain. Let us assume tat a signal y(x) in te spectrum domain as te only one kt component different from zero y(x) = α k (x)q k, α j (x) = 0, j k. Assume tat te ard tresolding algoritm identify tis nonzero component perfectly, i.e., ˆα j (x) = 0forallj k, wit te function estimate ŷ(x) =ˆα k (x)q k. Tis estimate is unbiased wit te variance E {( y(x) ŷ(x) ) 2 } = σ 2 q 2 k, q k = J Q l,k. (107) l=k Furter assume tat, say ICI algoritm also makes a perfect estimate of te adaptive scale as i + = k and gives te estimate as ŷ + (x) =ŷ k (x). According to (94) tis estimate can be presented as ŷ k (x) = k α j (x)q j (k), q j (k) = j=0 k Q l,j. l=j Tis estimate is also unbiased wit te variance E {( y(x) ŷ k (x) ) 2 } = σ 2 k j=0 qj 2 (k). (108)

31 V. Katkovnik / Digital Signal Processing 15 (2005) Comparing (107) versus (108) assumes for simplicity tat Q is te identity matrix, i.e., g j in (81) are ortonormal. Ten q k = 1andq j (k) = 1 and we obtain for te estimate variances E{(y(x) ŷ(x)) 2 }=σ 2 versus E{(y(x) ŷ k (x)) 2 }=σ 2 (k + 1). Tus, te MR ard tresolding algoritm as a smaller variance values for all scales k>0 wit te maximum advantage for te igest frequency scale k = J wen te variance of te ICI estimate takes its maximum value σ 2 (J + 1) versus te MR ard tresolding variance E{(y(x) ŷ(x)) 2 }=σ 2. Tis example sows tat signals wit sparse spectrum representation define a class were te MR adaptive estimation is able to demonstrate a better performance as compare wit te nonparametric metods based on te best scale selection. Te general analysis can be produced in order to reveal te ability of te introduced tecnique. Mainly, tis sort of results are of asymptotic nature assuming tat te number of observations n, te tresold t and te number J of scales in H are growing. It can be proved tat if n, t, and scales j are adjusted properly te best possible convergence rate can be acieved. Wile tis sort of analysis is beyond te scope of tis paper we wis to note tat actually many accuracy results obtained for te wavelet tecniques are applicable for te considered estimates at least provided tat te usual dyadic scale is assumed for j. 8. Algoritm implementation 8.1. Basic MR algoritm Main steps of te MR algoritm: (1) Set H ={ 0 > 1 > 2 > > J }, m, t; (2) For = j, j = 0,...,J, calculate: (a) Te kernels g j (x) (28), (b) Te MR kernels ω j (x) (81), (c) Te estimates â j (x) (97); (3) Apply one of te tresolding rules (102) (105) to te estimates ˆα j (x); (4) Calculate te MR adaptive estimate according to te final formulas (101). Note. Te step 2b defines a bank of te linear filters of different scales j. Step2cserves for calculation of te estimates for all j and x. Te estimate ˆσ for Step 3 can be obtained from te ig scale MR spectrum â J (x) as a robust median estimate ˆσ = median x ( â J (x) )/ Multiple window estimation A symmetric window w is a good coice in (28) and (29) if y is isotropic in a neigborood of te estimation point. However, if y is anisotropic, as it appens near discontinuities or image edges a nonsymmetric approximation of y becomes muc more reasonable. To deal wit te anisotropy of y multiple nonsymmetric window estimates are exploited

32 104 V. Katkovnik / Digital Signal Processing 15 (2005) Fig. 4. Te quadrant s segmentation of te neigborood of te LPA centre (0, 0). [18,19]. It assumes tat te neigborood of te pixel x is separated in a number overlapping or nonoverlapping subareas. Let s K suc subareas be introduced. Ten, te adaptive scale estimates ŷ [k], k = 1,...,K, are calculated for eac of tese subareas and fused togeter in order to yield te final estimate. Te four quadrant nonoverlapping segmentation of te pixel neigborood (see Fig. 4) is a simple and efficient way of fitting y [18,19]. It assumes tat te origin of te Euclidian rectangular coordinate (0, 0) is te centre of te LPA estimate for eac square quadrant subareas. For eac of tese quadrants te ŷ [k] kernel estimates wit te adaptive scale are calculated. Tus, for eac pixel of te image we are able to obtain four independent estimates based on different observations covered by te corresponding quadrant supports respectively. Tere are a number of ways ow to fuse te quadrant s estimates into te single final one. In particular, te inverse-variance weigted mean [9,18] or te sample mean can be applied. Te last estimate gives ŷ(x) = 1 K K ŷ [k] (x). (109) k=1 In our simulation we complete a set of te introduced quadrant s windows by te symmetric window wic is centered wit respect to te origin point (0, 0). Te multiple window estimation significantly improves te performance of te algoritms. As a furter development of tis idea special directional LPA kernels using narrow beam-wise supports are proposed in [19] Algoritm complexity Te calculation of te image estimate â j (x) for given j is te linear convolution requiring N conv n log n operations n = n 1 n 2. If te sectioning procedure is used for te convolution (e.g., [34]), ten N conv n log n j,weren j is a maximum size of te square mask of te kernel ω j. Tese calculations are repeated for eac of te K subareas (quadrants) of te pixel neigborood wit te fusing te estimates according to (109). Te tresolding is produced J times for eac of te K subareas. Tus, overall te algoritm complexity is proportional to J K N conv,werek = 5 for te four quadrant and symmetric window estimate.

33 V. Katkovnik / Digital Signal Processing 15 (2005) Parallels wit wavelets Let us provide few facts from te wavelet teory wic elp to demonstrate similarity and difference wit te introduced MR nonparametric local polynomial regression approac. Te standard MR continuous wavelet expansion for 1D continuous data, y(x), x R, as a form of te following parametric series (e.g., [11,31]): y W (x) = α k ϕ 0k (x) + α jk ψ jk (x), (110) k Z k Z j ϕ 0k (x) = ϕ(x k), ψ jk (x) = 2 j/2 ψ(2 j x k), (111) were ϕ(x) and ψ(x) are te scale function (fater wavelet) and te wavelet (moter wavelet) respectively, Z is a set of integers, 2 j stands for jt dyadic scale. Te scale used in tis paper is linked wit te wavelet dyadic scale by te equation = 2 j.forte ortonormal ϕ 0k (x) and ψ jk (x), x R, te coefficients of te series (110) are calculated as α k = y,ϕ 0k, α jk = y,ψ jk. (112) Te inner products used in tis section assumes integrals on x R, for instance α jk = y,ψ jk = + y(x)ψ jk(x) dx. Te ortonormality exploited in (112) means tat ψ il,ψ jk =δ ij δ kl, ϕ 0l,ψ jk =0, ϕ 0l,ϕ 0k =δ kl, (113) i.e., te intra- and enter-scale ortogonality of te functions (111) is assumed. Te intrascale ortogonality means tat te functions of te same scale j but different by te argument sift (variable k in (111)) are ortogonal, i.e., ψ jl,ψ jk =δ kl, ϕ 0l,ϕ 0k =δ kl.te inter-scale ortogonality means te ortogonality of all functions of te different scales, e.g., ψ ik,ψ jk =δ ij, as well as it is assumed also te ortogonality between te fater and moter wavelets ϕ 0l,ψ jk =0. Tis double intra- and inter-scale ortogonality requirements make a design of te wavelet function quite a complex art and confines te classes of wavelet functions. Te fater wavelet ϕ generates te following linear subspaces in L 2 (R): { V 0 = y(x) = ϕ(x s)c s : } c s 2 <, s s V 1 = { } f(x)= y(2x): y V 0, V j = { f(x)= y(2 j } x): y V 0, (114) suc tat (1) Te subspaces V j are nested, V j V j+1 and can be represented in te form V j+1 = V j W j+1, j = 0, 1,..., (115)

34 106 V. Katkovnik / Digital Signal Processing 15 (2005) were stays for te direct sum of te subspaces, W j+1 is an ortogonal complement of te subspace V j and te complement subspaces W j for every scale j are defined by te wavelets ψ jk (x). (2) Te subspace j=0 V j = V 0 j=1 W j is dense in L 2 (R). Tese ortogonal subspaces define te wavelet MR analysis [11,31]. It means tat any y L 2 (R) can be represented as a series (110) (112). Now let us compare te wavelet expansion (110) (112) wit te corresponding nonparametric regression MR expansions (89) and (90): Te wavelet expansion y W (110) is a standard ortogonal series wit invariant coefficients α k and β jk (112). As a function of x tis series is defined by te wavelets ϕ 0k (x), ψ jk (x) (111). It means tat (110) is a parametric expansion of y. In contrast to it te expansion (89) is nonparametric as its dependence on x cannot be parameterized and goes troug te coefficients (α 0 (x) and α j (x)) of te expansion. Tere are no invariant coefficients in tis expansion and te basis functions depending on x, wat is typical for te standard series. Te kernels ω 0 and ω j, j = 1,...,J, in (89) and (90) can be interpreted as te fater and moter wavelets, respectively, according to teir role in te analysis and te vanising moment conditions (83). Te kernel ω 0 defines a lower frequency background of te signal (fater wavelet analysis) wile te kernels ω j, j = 1,...,J,definea iger frequency complements to tis background (moter wavelet analysis). Te subspaces W j and V j are different for te wavelet and te introduced local regression MR analysis. Te principal difference is tat in te wavelets W j are ortogonal complements of V j wile tere is no suc ortogonality for te kernel MR analysis. Te dyadic scale in (110) (112) is a special point defining te wavelet design and fast algoritms. In te local regression MR analysis te scale as defined by te set H is quite arbitrary. Te only serious restrictions concern te linear independency of g for H. Te wavelet design for multivariable functions is a serious problem mainly solved by using te direct product of te 1D wavelets. Tere is no principal difficulties wit design of te LPA kernels for any dimension. We present ere te classical results of te continuous integral wavelet transform wit x R and y L 2 (R) versus te discrete MR local regression analysis for te kernel and estimates defined for te discrete x Z 2 and y l 2 (Z 2 ). We pragmatically consider only discrete signals as it results in clear numerical algoritms. A generalization of te introduced kernel MR is straigtforward for continuous signals belonging to L 2 (R) or L 2 (R 2 ). Te discussed parallels concern only te structures and te basic ideas of tese two different transforms. We may conclude tat te ideas of te MR used for te wavelets (114) and (115) and for te MR local regression (85) (87) are quite similar. However, in te considered nonparametric local polynomial version of te MR analysis many strict constrains typical for te wavelet tecnique may be dropped. Te nonparametric local polynomial approac is more flexible and as more freedom for design of te filters (kernels) wit concentration on te

35 V. Katkovnik / Digital Signal Processing 15 (2005) signal properties rater tan deal wit matematical difficulties essential for te wavelet design. 10. Simulation As a test signal we use te cameraman image (8 bit gray-scale) corrupted by an additive zero-mean Gaussian noise. Te LPA is used wit te uniform square window w, linear polynomials m = 1, and a finite set of te scales H ={21, 11, 5, 3, 1}. Te multiwindow estimate of y is applied as described in Section 8. We calculate five intermediate estimates obtained for four quadrant and one symmetric windows w, respectively. Eac of tese five windowing estimates is calculated as spatially adaptive using te developed MR algoritms. Te final estimate of y is calculated as te mean (109) of tese five adaptive scale estimates. Figure 5 illustrates ow te MR expansion (89) works. It images te items of te MR expansion α j (x) q j for te scales j = 0, 1,...,4. Te figures are given for te noiseless cameraman image. Te first term α 0 (x) q 0 presents a basic smoot lower frequency component of te image. Te furter terms wit j>0serve as te complements of te basic one providing some finer iger frequency details. Te images of te expansion items become sarper for larger values of te scale j. Te sum of all five MR expansion items sown in te last image of Fig. 5 gives a perfect reconstruction of te true image. Histograms of te images from Fig. 5 are sown in Fig. 6. Tey illustrate te concept of te redundancy of te proposed MR nonparametric regression expansion used in te tresolding filtering. Indeed, te istogram for j = 0 covers all segment [0, 1] of possible values of y. Te istograms for te complement components of te MR expansion are narrower and more pick-wise. Te last scale j = 4 as a smallest number of nonzero items wic are mainly concentrated in a narrow neigborood of zero. Actually it means tat te space of te MR analysis for j 1 is sparse: only a few of items of te MR nonparametric expansion are large and te rest are small and tus can be dropped. In quantitative comparison of te algoritms te following criteria ave been used: 1 (1) Root mean squared error (RMSE): RMSE = # x (y(x) ŷ(x))2 ; (2) SNR in db: SNR = 10 log 10 x y(x) 2 / x y(x) ŷ(x) 2 ; (3) Improvement in SNR (ISNR) in db: ISNR = 20 log 10 ( ˆσ/RMSE); (4) Peak signal-to-noise ratio (PSNR) in db: PSNR = 20 log 10 (max x y(x) /RMSE); (5) Mean absolute error (MAE): MAE = 1 # x y(x) ŷ(x) ; (6) Maximum absolute error: MAXDIF = max x y(x) ŷ(x). Tese criteria allow to evaluate te performance of te algoritm quantitatively, wile PSNR is treated as a criterion linked wit a visual image perception. However, it is appeared tat tese criteria gives quite concordant conclusions wile te visual evaluation is an independent performance criterion. In wat follows we mainly use only one criteria ISNR.

108 V. Katkovnik / Digital Signal Processing 15 (2005) 73 116 Fig. 5. Te spectrum expansion of te noiseless cameraman image.

Te furter items ˆα j (x)q j wit larger scales j serve as te complements of tis basic item and provide finer details. Te images of tese items become sarper for larger values of te scale.

36 108 V. Katkovnik / Digital Signal Processing 15 (2005) Fig. 5. Te spectrum expansion of te noiseless cameraman image. Te scale equal to zero corresponds to te first term of tis expansion ˆα 0 (x)q 0 and presents a basic lower frequency components of te image. Te furter items ˆα j (x)q j wit larger scales j serve as te complements of tis basic item and provide finer details. Te images of tese items become sarper for larger values of te scale. Te sum of all five MR expansion items sown in te last image is identical to te original cameraman image. In image denoising we compare te MR algoritms versus te ICI algoritm wic demonstrates a performance more less equivalent to te performance of te Lepski Spokoiny algoritm. Te tresold t is a main design parameter of te tresolding (102) (105). Multiple simulations and analysis produced for different images sow tat t = is a reasonably good value of tis parameter for different scenarios. It is a sort of te rule of tumb for selection of t. For comparison we sow also results for te oracle estimator. Figure 7 sows ISNR as a function of SNR of te observations. Te smooted Stein s algoritm (M 1 = 5 in (105)) demonstrates te best performance and outperforms te ICI

37 V. Katkovnik / Digital Signal Processing 15 (2005) Fig. 6. Te istograms of te images sown in Fig. 5. Te istogram for te scale equal to 0 is wide covering nearly wole segment [0, 1] even a bit wider tat te istogram of te true image. Te istograms for te complement components of te MR expansion for te scales j = 1, 2, 3, 4 become narrower wit smaller frequencies of nonzero items. Te last scale 5 ave smallest number of nonzero items wic are well concentrated in a narrow neigborood of zero. algoritm approximately on 1 db. Te soft-tresolding algoritm gives values wic are about 1 db worse tan tose for te ICI algoritm. Te basic Stein s algoritm sows te figures better tan te ICI algoritm only for SNR > 15 db. Te oracle estimate naturally demonstrates te best values of ISNR about 2 db iger tan te smooted Stein s algoritm. We do not sow results for te ard-tresolding as tey are worse tan tose for te soft-tresolding. Examples of te reconstructed images can be seen in Fig. 8, were te noisy image, te smooted Stein s, ICI and soft-tresolding images are sown. Visual evaluation is in favor of te smooted Stein s algoritm. Let us apply te MR algoritms for scale adaptive differentiation. We estimate te first derivatives of te cameraman image intensity on x 1 (orizontal axis) and on x 2 (vertical axis). In all following results we use te simulation scenario and LPA parameters as tey are for image denoising. For te sake of simplicity of presentation and discussion for differentiation we apply a single window estimate wit te symmetric window function w and te soft-tresolding only. Te scales for differentiation are defined by te set H ={21, 15, 11, 5, 3} wit te minimum admissible scale J = 3.

38 110 V. Katkovnik / Digital Signal Processing 15 (2005) Fig. 7. ISNR as functions of SNR of te noisy observations for te cameraman image. Te curves are given for te following four algoritms: soft-tresolding ( star ), Stein s ( plus ), smooted Stein s ( o ), ICI ( pentagram ). Te tresold t = 1.2. Te oracle estimation results are marked by squares. Overall, te advantage is in favor of te smooted Stein s algoritm, wic outperform te ICI algoritm about on 1 db. Figures 9 and 10 illustrate formation of te derivative estimates using te MR spectrum analysis produced by te kernels ω (k) j. Images in te figures correspond to te items of te sum (95) ˆα (k) j (x)q (k) j for te scales j = 0, 1,...,4 and te derivatives k = (1, 0) and k = (0, 1). Te last sixt image is te derivative estimate ŷ (k) (x) = J j=0 α (k) j (x)q (k) j,asit is defined in (95). Tese MR spectrums are given for te noiseless cameraman image. Te influence of te scale of te differentiation operator is clear seen. It varies from smooted derivative estimate given by te largest scale to te finer contour lines of te smaller scale estimates. In order to produce a quantitative analysis of differentiation we need to know accurate values of te derivatives. For suc image as te cameraman tese derivatives are unknown. However, tey can be evaluated numerically using te MR analysis. Let us assume tat tese accurate numerical derivatives are defined as te estimates given by te differentiation kernels g (1,0) (x) and g (0,1) (x) for te noiseless cameraman image provided tat te scale is equal to its minimum value. Tis minimum admissible scale value is = 3 for te considered LPA wit m = 2 and squared 2D support of te differentiating kernels. Table 2 provides data illustrating an improvement wic can be acieved by using te adaptive varying scale differentiators versus te differentiators wit a fixed invariant scale.

V. Katkovnik / Digital Signal Processing 15 (2005) 73 116 111 Fig. 8. Te images obtained by te soft tresolding, smooted Stein s and ICI algoritms. SNR = 10 db. Tresold parameter t = 1.

39 V. Katkovnik / Digital Signal Processing 15 (2005) Fig. 8. Te images obtained by te soft tresolding, smooted Stein s and ICI algoritms. SNR = 10 db. Tresold parameter t = 1.2 for te MR algoritms and Γ = 1.5 for te ICI algoritm. Te smooted Stein s algoritm demonstrates te best performance in terms of te ISNR values as well as visually. Table 2 Accuracy of te derivative estimation Invariant scale RSME MAE MAXDIF = = = = = Adaptive scale Te criteria values are given as a mean of te corresponding values obtained for te derivatives on x 1 and x 2. Te lines 1 5 of te table sow te criteria values for te invariant scale estimators and te last line corresponds to te MR soft-tresolding adaptive scale estimator (t = 1.2).It can be concluded tat te best scale invariant estimator as te scale equal to its minimum

112 V. Katkovnik / Digital Signal Processing 15 (2005) 73 116 Fig. 9. Te spectrum expansion of te derivative / x 1. Te noiseless cameraman image.

Te furter items wit larger scales serve as te complements of tis basic item and provide finer sarp details.

40 112 V. Katkovnik / Digital Signal Processing 15 (2005) Fig. 9. Te spectrum expansion of te derivative / x 1. Te noiseless cameraman image. Te scale equal to zero corresponds to te first term of tis expansion ˆα (1,0) 0 (x)q (1,0) 0 and presents a basic lower frequency (smoot) components of te derivative. Te furter items wit larger scales serve as te complements of tis basic item and provide finer sarp details. Te sum of all five MR expansion items sown in te last image is te MR nonparametric regression estimate of te derivative. value = 3. It can be noticed also tat tese accuracy figures are quite sensitive wit respect to te scale value. Furter, comparing te criteria values of tis best scale invariant estimator versus te corresponding values for te adaptive estimator, we can see quite a significant improvement in values of RMSE and MAE. It consists of about 30% for RMSE and about 40% for MAE. A less improvement can be seen in values of MAXDIF wic consists of about 10%. Visual effects of te adaptive scale differentiation are illustrated in Fig. 11.

V. Katkovnik / Digital Signal Processing 15 (2005) 73 116 113 Fig. 10. Te spectrum expansion of te derivative / x 2. Te noiseless cameraman image.

Te furter items wit larger scales serve as te complements of tis basic item and provide fine sarp details.

41 V. Katkovnik / Digital Signal Processing 15 (2005) Fig. 10. Te spectrum expansion of te derivative / x 2. Te noiseless cameraman image. Te scale equal to zero corresponds to te first term of tis expansion ˆα (0,1) 0 (x)q (0,1) 0 and presents a basic lower frequency (smoot) components of te derivative. Te furter items wit larger scales serve as te complements of tis basic item and provide fine sarp details. Te sum of all five MR expansion items sown in te last image is te MR nonparametric regression estimate of te derivative. Images presented in Fig. 11 sow te sum of absolute values of te estimates of te derivatives y/ x 1 and y/ x 2. Te left-and size image is obtained by using te derivative estimators wit te best invariant scale = 3. Te rigt-andside image corresponds to te MR varying adaptive scale soft-tresolding differentiator. Noisy components of te derivative estimates clearly seen in te left-and side image are well cleared out in te rigt-and side image wile te fine details of te edges of te cameraman image are

114 V. Katkovnik / Digital Signal Processing 15 (2005) 73 116 Fig. 11. Images sow te sum of absolute values of te estimates of te derivatives y/ x 1 and y/ x 2.

42 114 V. Katkovnik / Digital Signal Processing 15 (2005) Fig. 11. Images sow te sum of absolute values of te estimates of te derivatives y/ x 1 and y/ x 2.Teleft and size image estimates are obtained using te derivative estimator kernels g (1,0) and g (0,1) wit te best found invariant scale = 3. Te rigt and side image is obtained by te MR varying adaptive scale soft-tresolding estimators wit t = 1.2. Noisy components of te derivatives are clearly seen in te left-and side image are cleared out in te rigt and side image wile te fine edge details of te cameraman are well preserved. preserved. Tus, visually and quantitatively, te adaptive scale MR differentiator demonstrates better results as compared wit te scale-invariant estimator wit te best possible oracle scale selection. 11. Conclusions A new varying adaptive scale nonparametric local polynomial regression tecnique is proposed. It is based on te LPA applied for design of te filters joined wit te developed spectral MR analysis. Te usual scale adaptive local polynomial regression estimates are based on selection of te best scale. Te proposed MR analysis assumes multiscale transform of observations, filtering of te obtained local polynomial regression spectrums and fusing tese filtered spectrums in te final estimate. Tis final estimate is composed from te estimates of te different scales but not only single one as it is in te classical adaptive nonparametric local polynomial regression. Te MR estimate belongs to a more general class of estimates and is able to provide a better accuracy. Te presentation of te MR analysis is given in terms of image processing. However, te approac is applicable for data of any dimensionality defined on te regular or irregular grids. For te regular grids te MR analysis metods allow fast implementations based on te fast convolution algoritms. For te irregular grids tis sort of fast algoritms is not applicable. Te developed MR nonparametric tecnique is quite universal and can be applied for many different tasks. Te introduced spectral expansion allows to involve many traditional tecniques of image processing. In particular, one may weig spectral components or produce teir nonlinear transforms in order to obtain desirable image enancement effects. Te MR nonparametric local polynomial tecnique can be applied for edge detection, image improvements, recognition problems, etc.

Consider a function f we ll specify which assumptions we need to make about it in a minute. Let us reformulate the integral. 1 f(x) dx.

Consider a function f we ll specify which assumptions we need to make about it in a minute. Let us reformulate the integral. 1 f(x) dx. Capter 2 Integrals as sums and derivatives as differences We now switc to te simplest metods for integrating or differentiating a function from its function samples. A careful study of Taylor expansions