Noise Reduction for Enhanced Component Identification in Multi-Dimensional Biomolecular NMR Studies

Size: px

Start display at page:

Download "Noise Reduction for Enhanced Component Identification in Multi-Dimensional Biomolecular NMR Studies"

Alice Young
6 years ago
Views:

1 Noise Reduction for Enhanced Component Identification in Multi-Dimensional Biomolecular NMR Studies Nicoleta Serban 1 The objective of the research presented in this paper is to shed light into the benefits of multi-dimensional wavelet-based methodology applied to NMR biomolecular data analysis. Specifically, the emphasis is on noise reduction for enhanced component identification in multi-dimensional mixture regression. The contributions of this research are multi-fold. First, the wavelet-based noise reduction method applies to multi-dimensional data whereas most of the existing work focuses on one- or two-dimensional data only. The proposed wavelet-based methodology is founded on rigorous analysis of the dependence between wavelet coefficients, an important aspect of multi-dimensional wavelet de-noising. The wavelet de-noising rule is based on the Stein s unbiased risk estimator (SURE) where the smoothness thresholds vary with the resolution level and orientation of the wavelet transform and selected by controlling the False Discovery Rate of the significant wavelet coefficients. Second, this paper highlights the application of the wavelet methodology to multi-dimensional NMR data analysis for protein structure determination. The noise reduction method is general and applicable to multi-dimensional data arising in many other research fields, prominently in biology science. Our empirical investigation shows that reducing the noise using the method in this paper results in more detectable true components and fewer false positives without altering the shape of the significant components. 1 Introduction The objective of the research presented in this paper is to introduce a wavelet-based noise reduction method for enhanced component identification in multi-dimensional 1 The author is grateful to Gordon Rule for allowing her to use the data from the NMR experiment and for his mentorship in NMR research. She also thankful to Kobi Abayomi, Ray Carroll and Brani Vidakovic for their useful insights about the research and the presentation of this paper. 1

2 mixture regression described by the model Z i1,...,i d = L s (x i1,..., x id ; θ l ) + σɛ i1,...,i d, i 1 = 1,... M 1,..., i d = 1,..., M d (1) l=1 where s (x i1,..., x id ; A l, w l, τ l ) is a regression component identifiable by a set of parameters θ l and observed over a set of equally spaced d dimensional grid points, (x i1,..., x id ). We assume that the regression function s is bounded above zero, continuous and unimodal. Examples of such functions are Gaussian and Lorentzian, commonly used in modeling NMR data. The number of components L is large and unknown. One challenging statistical problem relevant to the multi-dimensional mixture regression model is component identification - estimation of L. Because the regression components are contaminated by noise and many components may be observed at the noise level, a preliminary step is to reduce the noise to enhance the estimation accuracy of the number of components L. Typically, data generated by the multi-dimensional mixture regression model (1) feature sharp changes, spatial inhomogeneity, signal sparsity and local correlated noise. A common method for noise reduction, which overcomes these difficulties, is data filtering using a multiscale bandpass filter or wavelet transform and reducing the noise in the wavelet domain using a spatially adaptive method. Two common wavelet-based noise reduction methods are hard and soft thresholding introduced by Donoho (1995) and Donoho and Johnstone (1995). In the past 15 years, a series of other noise reduction techniques have been explored. Jansen (2001) reviews the existing noise reduction methods complemented by methodological and theoretical results. Antoniadis et al. (2001) present a comparison study of a series of de-noising methods applied to one-dimensional signals. Recent research in waveletbased noise reduction explores wavelet coefficient shrinkage methods by taking into account the intrascale and interscale dependence of the wavelet coefficients. Some of the early works on wavelet coefficient shrinkage using interscale dependence are by Hall et al. (1997) and Cai (1999) who introduce the concept of non-overlapping block thresholding. Other relevant research is by Shapiro (1993); Jansen (2001); Cai and Silverman (2001), Abramovich et al. (2002), Pizurica et al. (2002), Shen et al. (2002), Portilla et al. (2003), Autin (2008), Chaux et al. (2008), Cai and Zhou (2009) and the references therein. Although there are many competitive approaches to wavelet-based noise reduc- 2

3 tion, they have not yet been fully explored for more than two-dimensions. Multidimensional wavelet noise reduction requires specific considerations as the structure of the coefficients is highly complex - the intrinsic dependence between coefficients extends to both intra- and inter-resolution levels and orientations. Moreover, the signal may be highly sparse and the signal-to-noise structure may vary with the dimensionality. The methodology presented in this paper contributes to the research on wavelet-based noise reduction with a procedure that focuses on d-dimensional data with d > 2. The noise reduction method introduced in this paper complements the work on block-thresholding by Cai (1999) and the methodological and theoretical considerations introduced by Johnstone and Silverman (1997); Cai and Silverman (2001); and Shen et al. (2002). In most of the work so far on block-thresholding, the selection of the block composition is arbitrary to some extent. One exception is Cai and Zhou (2009) introducing a non-overlapping block thresholding method allowing for varying data-driven block sizes across resolution levels. This work considers intra-scale dependencies only and applies to one-dimensional data - for d-dimensional, the problem of selecting the block size is equivalent to a d-dimensional optimization problem since the block size will vary across dimensions. Although many research papers highlight the importance of incorporating intra- and inter-scale dependencies in estimating and de-noising wavelet coefficients, none presents theoretically-founded explanations of selecting one block of coefficients over another. The wavelet-based methodology in this paper is founded on rigorous analysis of the dependence between wavelet coefficients, an important aspect of multi-dimensional wavelet de-noising. The emphasis of the multi-dimensional noise reduction method introduced in this paper is on enhanced component identification for the regression model in (1) - more detectable components for a given number of data samples and fewer false-positive components. With this objective in mind, the overlapping-block shrinkage rule investigated in this paper is based on the Stein s unbiased risk estimator (SURE) where the smoothness thresholds vary with the resolution level and orientation of the wavelet transform to optimally adapt to spatially inhomogeneities in multi-dimensional data generated by the regression model in (1). The smoothness thresholds are data-driven and selected by controlling the False Discovery Rate (FDR) of the significant wavelet coefficients, which in turn, implies controlling the FDR of the regression components, the primary objective of this research. In contrast, in most of the existing block- 3

4 thresholding methods (Cai, 1999; Cai and Silverman, 2001; Portilla et al., 2003; Autin, 2008; and Chaux et al., 2008), the smoothness thresholds are fixed across resolutions and orientations. The statistical application investigated in this paper is pertinent to the study of three-dimensional protein structure determination using Nuclear Magnetic Resonance (NMR). In NMR data analysis for biomolecular studies, one primary objective is to estimate parameters (e.g. chemical shifts) of the atomic nuclei of a protein under study when the protein is magnetized using a strong magnetic field. Under protein magnetization, targeted atomic nuclei undergo energy transfers; each energy transfer induces a signal which is mathematically described by a decaying sinusoid. Therefore, the NMR signal generated by a d-dimensional NMR experiment is a sum of decaying sinusoids commonly observed over equally spaced time points plus error ( L d ) S(t 1, t 2,..., t d ) = A l e iφ l e ts/τ sl e itsw sl + ɛ t1,...,t d (2) l=1 s=1 where each sinusoid is generated by an energy transfer between d atomic nuclei and their parameters in d-dimensional NMR experiments (Hoch and Stern, 1996). The model parameters of interest are the resonance frequencies w l = (w 1l,..., w dl ) (translated into chemical shifts), and the signal amplitudes A l (translated into structural distance of the atomic nuclei involved in the transfer of energy in specific NMR experiments). Also L is the number of observed energy transfers, which is large and unknown. The protein structure is resolved by accurately estimating the resonance frequencies and the signal amplitudes from data generated by NMR experiments. The traditional methodology in biomolecular NMR data analysis involves Fourier transformation (FT) of the NMR signal data complemented by other pre-processing steps (Hoch and Stern, 1996). After Fourier Transform, the resulting model is a d-dimensional mixture regression model as described by the model in equation (1). In this model, the parameters of the lth regression component are the location parameter w l = (w 1l,..., w dl ), which are the signal frequencies, width parameter τ l = (τ 1l,..., τ dl ), and amplitude parameters A l. In this paper, the regression components in model (1) are dubbed spectral components and their parameters are dubbed spectral parameters. Because of the one-to-one mapping between energy transfers and spectral components, the problem of identifying the parameters of the atomic nuclei undergoing energy transfers translates into accurately identifying and estimating the 4

5 spectral parameters. Spectral components and their parameters must be identified accurately for accurate prediction of the protein structure. In multi-dimensional NMR data, many of the spectral components will have low amplitude, and therefore, they will be difficult to identify from the noise in the data. Detecting spectral components with amplitude only slightly above the noise level without erroneously including noise components (false positives) is crucial for a robust and reliable component identification algorithm, which will lead to stable protein prediction (Herrmann, et al., 2002). In certain cases, the lack of a small number of essential spectral components can lead to a significant deviation of the structure (Güuntert, 2003). In NMR biomolecular studies, it is common practice to manually remove the noise or false positives and/or manually identify low amplitude components. In this paper, the approach for overcoming this difficulty is to partially reduce the noise in the NMR data using a wavelet-based methodology. The NMR literature has already acknowledged the potential of wavelet-based noise reduction but it has not fully explored the benefits of (wavelet-based) noise reduction for multi-dimensional NMR data (Trbovic et al., 2005; Dancea and Güunther, 2005). Generally, the de-noising methodology applies to other research applications generating multi-dimensional data. One such example is breast computed tomography (CT), where the mixture components correspond to lesion or tumor masses. In this application, it is important to identify the mixture components or lesions to characterize their size and their distribution providing information about the survival rate as well as whether the tumors are benign or malign (Ning et al. 2004). One difficulty in this application is identification of the small tumors from the noisy background. To overcome this difficulty, one approach is to reduce the noise using a spatially adaptive method similarly to the NMR component identification problem. Zhong, et al. (2004) applied wavelet-based de-noising to two-dimensional CT data. However, their method does not allow for spatial adaptivity to signal inhomogeneity and sparsity featuring CT data. Because three-dimensional CT data are increasingly used, multi-dimensional noise reduction methods will be key to enhanced detection of tumor masses using the CT technology. In this paper, we will first introduce the noise reduction method for multi-dimensional data in Section 2 and apply the proposed methods to three- and four-dimensional synthetic examples in Section 3 and to three-dimensional NMR data for two different experiments in Section 4. We conclude with a discussion and further considerations 5

6 on the methodology presented in this paper. 2 Noise Reduction in Multi-dimensional Data In the underlying model described in (1), Z i1,...,i d are observed intensities in a d- dimensional space and the errors ɛ i1,...,i d are assumed to be additive, normally distributed but locally correlated. These are common assumptions in NMR data analysis (Hoch and Stern, 1996; Gragea and Akke, 2003). The noise reduction method is a three-step procedure: 1. Decomposition of the multi-dimensional data using a wavelet basis; 2. De-noising the coefficients in the wavelet domain; and 3. Reconstruction based on the wavelet coefficients de-noised at step (2). In this section, we will first describe the noise reduction method including a discussion about the intra-resolution and inter-resolution dependence between coefficients and the definition of block of influence for a wavelet coefficient using these dependence relationships. Using a block-based statistic called a cumulative influence statistic, we further define the shrinkage or the de-noising rule for the wavelet coefficients and propose a method for deriving the shrinkage level. In the final subsection, we introduce a wavelet-based method for component identification. 2.1 Method Description In this study, we apply a separable or tensor-product orthogonal wavelet transform (Section 7 in Mallat, 1998 and Section 5 in Vidakovic, 1999), which maps the observed intensities to the wavelet domain coefficients j,m {Z i1...i d } i1...i d {ᾱ i1...i d } i1...i d (coarse coefficients), { β } j,m,i1...i d (detailed coefficients) where j = J 0,..., J indexes the resolution level, m = 1,..., 2 d 1 indexes the orientation level and (i 1... i d ) for i 1 = 1,..., 2 j ;... ; i d = 1,..., 2 j are grid locations at the resolution level j. A wavelet basis with a small number of vanishing moments is used to capture local regularities with a small Lipschitz exponent (see Mallat and Hwang, 1992 and the references therein). Daubechies and Symmlets (least asymmetric Daubechies) wavelets are the common choice for capturing regularities described 6

7 by spectral components in NMR data because they have a support of minimum size for a given number of vanishing moments (Daubechies, 1992). One key advantage of performing noise reduction in the wavelet domain rather than directly in the original domain is that the wavelet coefficients are approximately uncorrelated although the original data may be highly correlated. Johnstone and Silverman (1997) and Jansen (2001, Lemma 5.1) showed that for one-dimensional signals, if the noise in the data is stationary and correlated, then the variance of the wavelet coefficients will depend on the resolution level but otherwise, the coefficients are approximately uncorrelated. These results can be extended to multi-dimensional data with one modification; the variance of the wavelet coefficients will depend on both the resolution level and orientation. Similar to Johnstone and Silverman (1997), in this research, the detail coefficients are normalized β j,m β j,m σ j,m where ˆσ j,m 2 is the estimated variance within resolution j and orientation m; therefore, j,m we assume that the variance of the coefficients is one (V( β ) = 1).We estimate the resolution and orientation dependent variances using the mean absolute deviation estimator Block-Level Shrinkage A series of existing noise reduction methods use the idea of wavelet coefficient shrinkage using information from neighboring coefficients within the same resolution and across resolutions. For example, most block-thresholding methods use information between coefficients within the same resolution (Hall et al., 1997; Cai, 1999; Cai and Silverman, 2001; Cai and Zhou, 2009) and some across resolution levels (Chaux et al., 2008 and Portilla et al. 2003). Other noise reduction methods define inter-resolution dependence on a basis of a tree of wavelet coefficients (Jansen, 2001, Section 5.3.1; Autin, 2008). Most of the wavelet de-noising literature has focused on one- and twodimensional applications only. However, one- and two-dimensional models overlook the challenges arising in multi-dimensional data. The number of intra- and interresolution neighboring coefficients increases exponentially with the dimensionality and the dependence relationships vary not only with the resolution level but also 7

8 with the orientation. In this paper, we investigate a series of significance and insignificance relationships between wavelet coefficients and define a block of influence for each coefficient based on these relationships as described in this sub-section. Block of Influence. We define the block of influence of a wavelet coefficient based on significance and insignificance relationships with its neighboring coefficients. In this paper we define a significance relationship when a coefficient is significant if a set of specific neighbors are also significant. We define an insignificance relationship when the insignificance of a coefficient implies insignificance of specific neighboring coefficients. Using these relationships, in this paper, we provide a theoretical basis for selecting the block of influence under the multi-dimensional mixture regression model. Assuming the model defined in (1), we show in this paper that there is a significance relationship between a coefficient and its immediate neighbors from the same resolution and the same orientation stated in part (1) of Proposition 1. There is also a significance relationship between a coefficient and the coefficients corresponding to the same location and the same resolution but from different orientations. This relationship is stated in part (2) of Proposition 1. Finally, there is a insignificance relationship between a coefficient and its neighbors from the same location coefficients and the same orientation but the neighboring descendant resolution stated in part (3) of Proposition 1. Proposition 1. Under a series of assumptions for the mixture regression model (1) described in the Appendix, the following significance and insignificance relationships hold: 1. β j,m i 1 ±1...i d > 0,... β j,m ±1 > 0 βj,m > 0; (3) 2. For d 3 : β j,m > 0, for m = 2 d 1, m m β j,m > 0; (4) 3. β j,m = 0 β j+1,m 2i 1...2i d = 0, β j+1,m 2i i d = 0,... β j+1,m 2i i d +1 = 0. (5) The insignificance relationship in (3) holds even for higher descendent resolution levels, but in this research, we will only include the coefficients from the descendent resolution j+1 in the block of influence. In Appendix A, we prove the significance and insignificance relationships described in this proposition under specific assumptions about the shape of the components s in the mixture regression model described in (1). One has to bear in mind, that in practice, a significance relationship is when the 8

9 source coefficients are significantly way from zero and an insignificance relationship is when the source coefficients are approximately zero due to the presence of noise. β j,m β j,m ±1 Based on the significance and insignificance relationships described in Proposition j,m 1, the block of influence is defined based on three sources: (a) coefficient itself ( β ); (b) immediate neighboring coefficients from the same resolution level and the same orientation ( i 1 ±1...i d,..., ); and (c) the same location coefficients from the same j+1,m j+1,m orientation but the neighboring descendant resolution ( β 2i 1...2i d,..., β 2i i d +1 ). The source of influence in (b) is a significance source (part 1 of Proposition 1) and the source of influence in (c) is an insignificance source (part 3 of Proposition 1). We use the two sources of energy in (b) and (c) to balance the significance and insignificance influence from the intra and inter-scale neighboring coefficients. For multi-dimensional data with d 3, we may replace the significance source (b) with the wavelet coefficients from the same resolution, the same location but different orientations (β j,m, with m m), since this is also a significance source (Proposition 1, part 2). Shen, et al. (2002) proposed a similar block composition for two-dimensional data. The block defined in their approach includes all three sources of energy in Proposition 1. That is, it includes two significance sources and one insignificance source. With two sources of significance, the block-shrinkage rule will be biased towards significant coefficients. Moreover, for two-dimensional data, the significance relationship coming from coefficients from different orientations does not hold under the modeling assumptions stated in the Appendix. We define the average influence measures for the three sources of a wavelet coefficient at resolution j, orientation m and grid location (i 1,..., i d ) in the following equations ( S j,m )1 ( S j,m )2 ( S j,m )3 = ( βj,m = 1 2d ) 2 ( ( βj,m i i d ) 2 + ( βj,m i i d ) ( βj,m +1 = 1 2 d ( ( βj+1,m 2i 1...2i d ) 2 + ( βj+1,m 2i i d ) ( βj+1,m 2i i d +1 ) 2 ( ) 2 ) + βj,m 1 ) 2 ) and the cumulative influence statistic as S j,m = ( S j,m ) 1 + ( S j,m ) 2 + ( S j,m )3. (6) The cumulative energetic influence balances the average influence from both intra- 9

10 and inter-scale dependence sources as supported by Proposition 1. Shrinkage rule. James-Stein estimator We estimate a coefficient using a shrinkage rule based on the ˆβ j,m i 1...i n = ( 1 L j,m S j,m ) + β j,m i 1...i n. (7) where L j,m is a resolution and orientation dependent shrinkage level. The James-Stein estimator has been previously used to shrink/estimate the detail wavelet coefficients in soft-thresholding (Donoho and Johnstone, 1995) and block-thresholding (Cai, 1999) for one-dimensional data and for two-dimensional data (Chaux et al., 2008). In the method introduced in this paper, the cumulative influence statistic S j,m i 1...i n accounts for three sources of energy influence in contrast to soft-thresholding which uses source (a) (the coefficient itself), and Cai s block-thresholding which uses sources (a) and (b) (the intra-scale neighboring coefficients). Chaux et al. (2008) introduce a generalized non-overlapping thresholding method which allows for any nonoverlapping block composition. In addition, the existing block-thresholding methods (Cai, 1999; Cai and Zhou, 2009; Chaux et. al., 2008) shrink the coefficients in nonoverlapping blocks. Portilla et al. (2003) point out that non-overlapping block thresholding leads to noticeable de-noising artifacts at the discontinuities introduced by the block boundaries. In our simulation study, we compare non-overlapping block thresholding using the method by Cai and Zhou (2009) but extended to multi-dimensional data to moving-block thresholding introduced in this research paper. By individually shrinking the wavelet coefficients, we will show that the mean squared error improves without additionally altering the signal components Wavelet Coefficient Shrinkage Level The shrinkage rule in equation (7) is a function of a shrinkage level L j,m. We allow for resolution and orientation dependent levels to adapt to spatial inhomogeneity and signal sparsity. The shrinkage levels L j,m are evaluated assuming independent multivariate estimation problems across resolution levels and orientations similar to current wavelet-based noise reduction methods which allow for resolution-varying thresholding levels (see Jansen, 2001 for a review). In this context, L j,m is a smoothing parameter for resolution level j and orientation m controlling the trade off between fitting and smoothing of the wavelet coefficients where smoothing is measured using the L 2 norm when using the shrinkage rule defined in (7). We will probably not discover many 10

11 components when L j,m s are large. As we decrease L j,m, we can potentially discover more components at the price of less noise smoothing. Consequently, obtaining L j,m is important to maximize the number of true discoveries with the inclusion of only a small number of false positives and without distorting the shape of the components. The shrinkage level L j,m acts as a significance threshold for the cumulative influence statistics S j,m S j,m at the resolution level j and orientation m. That is, when > L j,m, the cumulative influence statistic S j,m is assigned to be significant. To evaluate L j,m, we assume S j,m = X j,m (s i1...i d ) are observed from a d-dimensional random field X j,m (s); because the average influence measures are computed based on moving blocks, there is an underlying known spatial dependence between S j,m. That is, we observe the random field X j,m (s) over a regular grid of points within some closed space S j, and with mean E(X j,m (s)) = µ j,m (s) and covariance surface C{X j,m (s), X j,m (s )} = C j (s, s ). The covariance surface C j (s, s ) is known and of second order stationarity. The covariance structure varies with the resolution level j provided by C(S j,m i 1,...,i d, S k1,...,k d ) = where i k 2 = (i 1 k 1 ) (i d k d ) 2. σ 2 d i k 2 = 1 σ 2 2d 2 i k 2 = 2 σ 2 4d 2 i k 2 = 4 Methodological and theoretical statistical research for identifying global significance thresholds for random fields has been applied to imaging and astrophysics. A few representative references are by Siegmund and Worsley (1995); Cao and Worsley (1999); Hopkins et al. (2001); and Pacifico, et al. (2004). Shrinkage level via hypothesis testing. Similar to Pacifico, et al. (2004), we determine the shrinkage level for β j,m i 1...i n hypothesis testing. The hypothesis test is or the significance threshold for S j,m H 0 (X) : µ j,m (s) = µ j,m 0 vs H A (X) : µ j,m (s) > µ j,m 0. using Denote S j,m 0 = {s S j : µ j,m (s) = µ j,m 0 } as the null set. The shrinkage level is obtained by controlling the false discovery rate Γ j,m (L) = λ(rj,m L Sj,m 0 ) λ(r j,m L ) 11

12 where R j,m L = {s Sj : X j,m (s) > L} (alternative set) and λ is a probability measure. Consequently, select L j,m such that L j,m = arginf L {L : Γ j,m (L) ɛ}. (8) where ɛ is a tolerance level for the error rate. However, in the formulation above, we do not have the null set S j,m 0, and therefore, we will derive the shrinkage level by controlling an estimated error rate ˆΓ j,m (L) which is obtained by replacing the null set S j,m 0 with an estimated superset. Estimation of the shrinkage level. Following Pacifico, et al. (2004), we estimate a superset of S j,m 0 called U j,m by testing for all subsets in the sample space A S j : H 0,j,m : A S j,m 0 vs H A,j,m : A S j,m 0. Based on this testing procedure, the superset U j,m is the union of all subsets A for which the null hypothesis H 0,j,m is not rejected at the significance level α. Because the signal information at a resolution level j is partitioned across 2 d 1 orientations, we need to correct for multiplicity in simultaneous inference within resolution level j. It follows that Pr{U j,m S j,m 0, m = 1,..., 2 d 1} 1 α, j = J 0,..., J 1. (9) In this paper, we obtain the probability values by simulating from the null distribution assuming that the wavelet coefficients are normally distributed with a known dependence structure. Using the estimated superset U j,m, the estimated false discovery rate becomes: ˆΓ j,m (L) = λ(rj,m L U j,m ) λ(r j,m L ). We therefore obtain the shrinkage level L j,m by controlling ˆΓ j,m (L). That is, L j,m = arginf L {L : ˆΓ j,m (L) ɛ}. The condition in (9) and the inequality ˆΓ j,m (L) Γ j,m (L) imply that Pr{Γ j,m (L j,m ) ɛ, m = 1,..., 2 d 1} 1 α, j = J 0,..., J 1. (10) 12

13 In all our examples in the next sections, we used α = 0.05 and ɛ = 0.1. That is, we allow for 10% of false positive coefficients at confidence level of The number of detectable components is more sensitive to the condence level than the tolerance level ɛ. Interpretation of the Error Rate. The hypothesis testing procedure discussed above is used to identify an optimal shrinkage level L j,m for the wavelet coefficients at resolution j and orientation m. We interpret the error rate criterion in (10) as the proportion of false discoveries, where a discovery is a significant coefficient defined by the alternative hypothesis H A (X); this proportion is smaller than ɛ with a probability 1 α within a fixed resolution level j. The choice of ɛ depends on the problem at hand. For example, when ɛ is close to zero, we allow for a small proportion of false positives but we may also fail to identify components close to the noise level. 3 Simulation Studies In this simulation study, we compare the multi-dimensional denoising method introduced in this paper to three common thresholding methods including hard-thresholding (Donoho and Johnstone, 1994), soft-thresholding (Donoho and Johnstone, 1995), and non-overlapping block thresholding (Cai and Zhou, 2007). For multi-dimensional wavelet de-noising, varying block size as suggested by Cai and Zhou (2007) requires solving a d-dimensional optimization problem. Because of this difficulty, in this simulation study, the block size is fixed and approximately equal to the log of the number of grid points (log(m k )) for each dimension k = 1,..., d, which has been suggested by Cai (1999) and Cai and Silverman (2001) for one-dimensional data. For all three comparison methods, the shrinkage levels vary with both resolution and orientation and they are selected to minimize the James-Stein risk estimator. We compare these four methods not only by evaluating the mean squared errors but also by investigating how well they perform in terms of component identification enhancement. Simulation Settings. To evaluate these methods in the context of our application, we simulate data following the general model for multi-dimensional NMR data described in (1). We simulate data in three and four dimensions following this model where the function s is assumed to be a Lorentzian - commonly assumed as shape function in biomolecular NMR data analysis (Hoch and Stern, 1996). Therefore, in 13

14 our simulation study, we assume that the mixture regression function in (1) is f(x 1,..., x d ) = L l=1 ( d ) A l / s=1 τ sl d ( ). (11) s=1 (xs ω sl ) 2 τ 2 sl + 1 The parameters of the simulation model are as follows. The amplitudes A l, l = 1,..., L vary in the interval of values [10, 100] and the noise standard error varies: σ = 10, 15 and 20. In this example, the number of Lorentzian components is L = 500 on a grid of points for d = 3. We simulate the error term from a multivariate normal with local dependency provided by an autoregressive process of lag four. For each noise level σ, we compare the four methods based on one simulated data set. Separation of the insignificant and significant wavelet coefficients. In order to motivate the use of neighboring wavelet coefficients in classifying significant coefficients from significant ones as well as in their estimation, we investigate the separation between the density functions of the insignificant (corresponding to signal-free locations on the boundaries) and significant wavelet coefficients with respect to two statistics: the coefficient magnitudes and the cumulative energetic influence defined in (6). Figure 1 shows the log-scale density functions for one orientation only and for the wavelet coefficients at the finest resolution. For the magnitude statistic, the density functions overlap considerably, whereas for the cumulative energetic influence statistic, there is a clear separation - the wider the separation, the more effective the shrinkage rule is. It is important to note that the density separation for the energetic influence statistic improves for lower noise level and at higher dimensionality as more information is incorporated in the cumulative energetic influence statistic. Therefore, the significance test discussed in Section 2.1 will improve in power and the error rate will be less conservative for high-dimensionality. Mean squared error (MSE). We first compare the MSE for the method introduced in this paper and for the three comparative thresholding methods. In this simulation study, moving-block thresholding with ɛ and α varying in [0.01, 0.1] outperforms all three comparative methods in terms of MSE. The results discussed in Table 1 are averages over 10 repeated simulations. We have also computed the mean squared error for other block compositions. The block replacing the significance source of influence with the coefficients from different orientations (Proposition 1, part 2) 14

15 Density estimate: Cumulative Energetic Influence Density estimate: Magnitude 0.25 Significant Significant Insignificant Insignificant Density functions: Cumulative Energetic Influence Density functions: Magnitude 0.18 Significant Significant Insignificant 0.16 Insignificant Figure 1: Density functions for the energetic influence statistic (left plot) and magnitude statistic (right) separated for significant and insignificant coefficients. Upper plots for three-dimensional synthetic data and lower plots for four-dimensional synthetic data. performs similarly to the block defined in this paper. On the other hand, the block including both sources of significance provided higher MSE. The rate of decrease in MSE over the best performing comparison method (non-overlapping block thresholding) compared to the proposed block-thresholding method is higher for lower signal levels. Therefore, at very low signal-to-noise ratios, block-thresholding methods will perform similarly. σ SYNT HARD SOFT Non-overlap BLOCK Moving BLOCK (ɛ =.01 and α =.05) 10 = = = Table 1: Mean squared errors for the noised data (first column) and for de-noised data using hard-thresholding, soft-thresholding, non-overlapping block thresholding and the method introduced in this paper. 15

16 Number of Components. To highlight the benefits of noise reduction for enhanced estimation of the number of spectral components L, we analyze the performance of a commonly employed component identification method, which detects local maxima above a fixed threshold. A local maximum corresponds to a location (x i1,..., x id ) with intensity value Z i1,...,i d larger than the immediate neighboring intensities. Hence the local maxima identified with this method are initial estimates for the spectral components. This component identification method has been implemented in most of the existing commercial and non-commercial software packages for NMR data analysis (Gronwald and Kalbitzer, 2004; Güntert, 2003). We apply this component identification method complemented by a test for component identifiability introduced in Serban (2007). The noise level for the simulated data is σ = 15. Since for simulated data the locations of the spectral components (w l, l = 1,..., L) are known, we evaluate the performance of the component identification method using false discovery rate (FDR) and false negative rate (FNR). FDR at a threshold T is computed as the number of false positives (defined as local maxima which do not correspond to locations of true spectral components) divided by the total number of local maxima discovered up to the threshold T. FNR at a threshold T is computed as the number of false negatives (defined as undetected local maxima which correspond to locations of true spectral components) divided by the sum of false negatives and true positives up to the threshold T. In Figure 2, FDR (left plot) and FNR (right plot) are compared for simulated data without de-noising (black solid line) and with de-noising (colored solid lines). On one hand, the FDR computed for noisy data is large whereas the FNR is zero. This is the extreme case since a large number of false positives are introduced to detect all the spectral components. One another hand, the four different de-noising methods perform differently with respect to FDR and FNR. For example, the soft thresholding method has a high FNR over all thresholds and hard thresholding has a high FDR. The two block thresholding methods perform similarly in terms of FDR but the block-thresholding method in this paper has a lower FNR at higher thresholds implying less component shape alteration or smoothing. This finding is supported by a lower estimation error of the amplitude parameters when the block-thresholding method introduced in this paper is employed as presented in the next result. Parameter Estimation. One primary objective of the analysis of NMR biomolecular data is to obtain parameter estimates for the frequency parameters (w l, l = 16

17 False Discovery Rate noised hard denoised soft denoised sure block denoised block denoised False Negative Rate hard denoised soft denoised block sure denoised block denoised Threshold values Threshold values Figure 2: False Discovery Rate (left) and False Negative Rate (right) resulting from component identification applied to noisy data and de-noised data using four different methods: hard-thresholding (blue), soft-thresholding (red), non-overlapping block thresholding (green) and moving-block thresholding (purple). 1,..., L) and amplitude parameters (A l, l = 1,..., L). Since de-noising methods not only reduce the noise in the data but also smooth out signal components, the parameter estimates, more specifically the amplitudes, may be altered. To evaluate whether the amplitude estimates are altered after de-noising, we applied an initial estimation method to de-noised data. We compare the amplitude estimates for the two block de-noising methods (non-overlapping block thresholding with SURE shrinkage levels and moving-block thresholding). The amplitude error estimates are presented in Figure 3. The smaller error rates for the moving-block thresholding method imply less altered amplitude estimates than for the non-overlapping block thresholding. Component Identification in Wavelet Domain. Lastly, we evaluate the Lipschitz exponent estimator introduced by Pizurica et al. (2002), which measures the local signal regularity characterized by the decay of the wavelet transform amplitude across resolution levels. The Lipschitz exponent estimator can be used in identifying regression components for the mixture regression model in (1). An exponent larger than 1 will correspond to a significant regression component whereas a coefficient close to zero corresponds to a noise component. To investigate the benefits of using de-noising in component identification, we plot the Lipschitz exponent estimates for insignificant and significant wavelet coefficients. Furthermore, we compare the density functions of the estimated exponents for the de-noised and noised wavelet coefficients and for both three-dimensional and four-dimensional simulation data. The 17

18 estimated amplitude error true amplitude Figure 3: Amplitude error estimates for simulated data: blue point-down triangles are the error estimates after de-noising using the block thresholding method in this paper and black point-up triangles are error estimates after de-noising using non-overlapping block thresholding with SURE shrinkage. separation improves significantly for the de-noised coefficients as provided in Figure 4. This supports the use of wavelet-based noise reduction as a preliminary step in component identification. 4 3D NMR Frequency Data The experimental data explored in this study are for a doubly-labeled sample of a 130 residue RNA binding protein- rho130 - using standard triple resonance experiments on a 1 mm protein sample at a proton frequency of 600 MHz as introduced in Briercheck et al. (1998). The data were processed with FELIX (Accelrys Software Inc.) using apodization and linear prediction methods that are typical for these types of experiments. The two NMR experiments which generated the data analyzed in this paper are HNCOCA and HNCA. HNCOCA is a three-dimensional experiment in which each spectral component arises due to correlations between the amide nitrogen and amide proton of a specific residue and the alpha carbon of the preceding residue in the protein sequence. Therefore, the true number of spectral components will be slightly larger than the number of protein residues (130 residues for the protein under study). HNCA is also a three-dimensional experiment in which spectral components are paired 18

19 Noised Coefficients De Noised Coefficients Significant Significant 0.8 Insignificant 0.8 Insignificant Noised Coefficients De Noised Coefficients Significant Significant Insignificant Insignificant Figure 4: Density functions for Lipschitz exponent estimates of the noised wavelet coefficients (left plot) and of the de-noised wavelet coefficients (right) separated for significant and insignificant coefficients. Upper plots for three-dimensional synthetic data and lower plots for four-dimensional synthetic data. with similar amide nitrogen and amide proton frequencies. In HNCA, a pair of spectral components arises due to correlations between the amide nitrogen, amide proton and the alpha carbon nuclei of the preceding residue as in HNCOCA and of the intra-residue. Therefore, for this experiment, the true number of spectral components will be slightly larger than twice the number of protein residues and half of the spectral components will match the spectral components in HNCOCA. We graphically describe the nuclei correlations in the two experiments in Figure 5. Number of Spectral Components. For the two HNCA and HNCOCA dataset, we evaluate whether wavelet-based de-noising enhances estimation of the number of components and whether the method for wavelet coefficient shrinkage introduced in this paper out-performs other existing methods. For this, we apply the local maxima identification method discussed in Section 3 and the method described in this paper. We identify local maxima above a fixed threshold - T = 8000 for HNCOCA data and T = for HNCA data. These thresholds are approximately equal to the noise level estimated as ˆσ 2 log(m 1 M 2 M 3 ) where ˆσ is the mean absolute variance estimator. Because the HNCA and HNCOCA NMR data are for the same protein, we can 19

20 HN(CO)CA H C H H C H 15 N C C N C 1 H H O H H 13 C H C H H C H 15 N C C N C 1 H H O H H HNCA 13 C Figure 5: Description of correlations in HNCA and HNCOCA NMR experiments. evaluate which spectral components are true positives and which are false positives or false negatives by comparing the locations of the identified local maxima in the two datasets. Half of the local maxima identified for HNCA data need to have similar locations to local maxima identied for HNCOCA data. Moreover, the local maxima identified for HNCA data need to pair with respect to amide nitrogen (first dimension) and amide proton frequencies (third dimension). We also know the number of protein residues 130, and therefore, we know that for HNCOCA and HNCA NMR data, we need to detect slightly more than 130, and respectively, 260 spectral components. In Table 2, we present the results of this comparison. Specifically, we report the number of local maxima identified for noisy and de-noised HNCA and HNCOCA data (first and second rows). We also show the number of paired local maxima identified for HNCA data (third row) and the number of HNCA pairs matching HNCOCA local maxima (forth row). Not all spectral components in HNCA data are paired because of missing spectral components and because some components are close to the noise level. Lastly, we present the total number of HNCA local maxima matching HNCOCA local maxima which is an estimate for the detectable HNCA components (fifth row). The last row is an evaluation of the false discovery rate calculated as the proportion of the detectable HNCA-HNCOCA matches and the total number of local maxima. The first observation based on the results reported in Table 2 is that the number of false local maxima is extremely large for noisy HNCA and HNCOCA data. More than half of the local maxima are false positives at the specified threshold. One alternative is to use a higher threshold but if we increase the threshold, a large number of true 20

21 Noisy Data HARD SOFT Non-overlap Moving BLOCK BLOCK # of local maxima for HNCA data # of local maxima for HNCOCA data # of HNCA pairs # of HNCOCA local maxima matching HNCA pairs # of detectable HNCA components matching HNCOCA components False Discover Rate Table 2: The number of local maxima identified for HNCA and HNCOCA data, the number of pairs for HNCA and the number of matches between HNCA and HNCOCA local maxima. positives will not be detectable. Therefore, for these two NMR experiments, wavelet based de-noising greatly enhances the component identification. A second observation is that the four comparative denoising methods perform rather differently. Soft-thresholding method over-smooths the signal components resulting in a small number of local maxima for both HNCA and HNCOCA. On the other hand, hard-thresholding under-denoises or/and adds spurious aberrations of Gibbs type resulting in a larger number of false positives for both HNCA and HN- COCA data at the specified threshold. These observations are consistent with our FDR and FNR comparisons in the simulation study. Importantly, the block de-noising method in this paper allows for the largest number of HNCA-HNCOCA detectable components (provided in the fifth row of Table 2) at the lowest false discovery rate. Moreover, the number of HNCA pairs matching HNCOCA local maxima is also the largest for the block method introduced in this paper. Therefore, with moving block de-noised data, there is a high level of matching between HNCA and HNCOCA components. De-noising level. We compare the mean squared differences between observed and de-noised data within 64 non-overlapping spatially contiguous regions for the HNCA and HNCOCA data (see Figure 6). For a spatially adaptive de-noising method, 21

22 we would expect that the mean squared residual to be approximately equal for all 64 regions. For hard- and sure-soft thresholding, the mean squared residual varies greatly with large spikes in regions where most of the signal components are located. These findings confirm our assessment on the detectable number of components. Although both the hard and soft thresholding over-smooth the spectral components, only soft-thresholding results in a smaller number of local maxima whereas in contrast, hard thresholding leads to the largest number of detectable local maxima (above the specified threshold). One reason for this difference in the number of detectable local maxima is that hard thresholding adds spurious aberrations of Gibbs type resulting in a larger number of false positives for both HNCA and HNCOCA data at the specified threshold. For the block-shrinkage method in this paper, the mean squared residual varies less across the 64 regions as compared to all three comparative de-noising methods. Moreover, the mean squared differences are smaller for the block-thresholding method throughout all regions. These results indicate that the method introduced in this paper is more conservative in the sense that the signal-free areas will be less denoised but the signal components will preserve their shape and amplitude. This is an important aspect of the noise reduction method since the estimation of the spectral parameters will be more accurate. Overall de-noising. Next we compare the HNCA (FT transformed) data with its de-noised version using the moving-block thresholding method in Figure 7. The (FT) NMR HNCA data are very noisy with varying signal to noise ratio. The signal is mainly concentrated at lower values of the z axis. Ideally, we would smooth out the noise without altering the signal components. The block de-noising method clears the noise reasonably well. There is a trade-off between reducing the noise in signalfree areas and the estimation accuracy of the number of spectral components and of their parameters. With the block method presented in this paper, we can control this trade-off through varying the level of the false discovery rate (ɛ) and its corresponding probability (α). For these results, we used α =.05 and ɛ =.01. In our case study, the primary objective of the noise reduction is component identification, and therefore, it is important to control the error level ɛ at a small level. The choice of α in a range of [.05,.15] does not change the results significantly. 22

23 5e+08 Hard Soft Sure Block Moving Block 3e+08 Mean Squared Residuals HNCA Data e e e+08 Mean Squared Residuals HNCOCA Data Hard Soft Sure Block Moving Block Figure 6: Mean square differences (MSD) between observed and de-noised data within 64 non-overlapping regions for the HNCA frequency data Figure 7: (left)hnca NMR frequency data: observed (left) and denoised using the moving-block method in this paper (right). 23

A Statistical Test for Mixture Detection with Application to Component Identification in Multi-dimensional Biomolecular NMR Studies

A Statistical Test for Mixture Detection with Application to Component Identification in Multi-dimensional Biomolecular NMR Studies Nicoleta Serban 1 H. Milton Stewart School of Industrial Systems and