A Statistical Test for Mixture Detection with Application to Component Identification in Multi-dimensional Biomolecular NMR Studies

Size: px

Start display at page:

Download "A Statistical Test for Mixture Detection with Application to Component Identification in Multi-dimensional Biomolecular NMR Studies"

Lucy Manning
5 years ago
Views:

1 A Statistical Test for Mixture Detection with Application to Component Identification in Multi-dimensional Biomolecular NMR Studies Nicoleta Serban 1 H. Milton Stewart School of Industrial Systems and Engineering Georgia Institute of Technology nserban@isye.gatech.edu Pengfei Li Department of Statistics and Actuarial Sciences University of Waterloo pengfei.li@uwaterloo.ca In this paper, we introduce a statistical hypothesis test for detecting mixtures in a regression model described by a regression function which is a weighted sum of multi-dimensional unimodal functions. Two regression components are mixed when the distance between their centers is small or the proportion of their contribution to the mixture is close to zero or one. Two challenges in model estimation under the null hypothesis of one regression component are that the mixing proportion lies on the boundary of the parameter space and that the parameters are non-identifiable. Therefore, the parameter estimators derived from standard nonlinear estimation approaches are inconsistent and unstable. To overcome these challenges, we study a penalized regression test statistic with a relatively simple quadratic approximation which can be used to simulate the quantiles of the test statistic under the null hypothesis. We also show that the parameter estimators using the penalized regression approach in this paper are consistent. The motivating application is detection of mixed components in multi-dimensional biomolecular Nuclear Magnetic Resonance (NMR) data. Keywords: biomolecular NMR, density mixture, mixture test, multi-dimensional mixture regression, likelihood ratio test. 1 Correspondent Author 1

2 1 Introduction In this paper, we focus on a specific nonlinear regression model described by Z i1,...,i d = L A l s (x i1,..., x id ; ω l, τ l ) + σϵ i1,...,i d, i 1 = 1,... M 1,..., i d = 1,..., M d (1) l=1 where the multi-dimensional regression component A l s (x i1,..., x id ; ω l, τ l ) is a parametric function uniquely identified by a set of parameters describing the amplitude (A l ), the width (τ l = (τ l1,..., τ ld ) τ ) and the location (ω l = (ω l1,..., ω ld ) τ ) of the regression component. The regression function is nonlinear in the location and width parameters. The shape function s is assumed to be known, symmetric and unimodal. The d-dimensional design points (x i1,..., x id ) are fixed and equally spaced and observed over a compact set. For brevity, we assume (x i1,..., x id ) [, 1] d. It follows that the location parameters ω l, l = 1,..., L also fall within the same compact set [, 1] d. The error terms ϵ i1,...,i d are commonly assumed to be independent and identically distributed. The observations Z i1,...,i d are intensity values. Each regression component in the model (1) will be revealed as a local maximum in the observed intensity data Z i1,...,i d, when the distance between components is large and when the signal-to-noise level is high (the amplitudes are well away from the noise level). That is, under these assumptions, there is a one-to-one mapping between the regression components and the local maxima in the data. However, in many applications such as the motivating NMR case study, the distance between components is small, and therefore, there will be mixed regression components that overlap into one local maximum. In Figure 1, we show an example of two-dimensional overlapped regression components; because the distance between components is small, the two components merge into one local maximum. It is this type of mixed components that is of interest in this paper. To identify statistically significant mixed components, we introduce a hypothesis test for 2

3 Figure 1: An example of two-dimensional overlapped regression components: contour and perspective plots. mixtures in the regression model described in (1). In practice, application of the mixture test involves two steps: 1. Apply a local maxima identification method (Serban, 27, 21 and the references therein); and 2. Apply the testing procedure in this paper to the local intensity data of each local maximum to decide whether it maps to one or more regression components. In NMR data analysis, the evaluation of local maxima for detection of overlapped components is commonly performed manually. The mixture statistical test introduced in this paper will reduce the manual intervention which is error prone and time consuming. For illustration of the mixture hypothesis test, we consider a specific parametric function s, the Lorentzian function, commonly used to model Nuclear Magnetic Resonance (NMR) data (see Section 5). However, our theoretical results apply to other symmetric unimodal shape functions as long as the regression components are identifiable. Under the assumption of a mixture of Lorentzian regression components, the model 3

4 becomes Z (l) = α ( d ) A/ s=1 τ s d s=1 {(x is ω s1 ) 2 τs 2 + 1} + (1 α) ( d ) A/ s=1 τ s d s=1 {(x is ω s2 ) 2 τs 2 + 1} + σϵ, (2) where Z (l) are observations from a local region of the complete data Z i1,...,i d. For simplicity of the notation, we will refer to Z (l) as Z i1,...,i d but we hereby note that we will apply the hypothesis test locally since the full data will generally consist of a large number of regression components. We do not need to correct for multiplicity in testing for two components across multiple regions of the complete data since the testing is not simultaneous. In model (2), the amplitude parameters A 1 and A 2 of the two components are uniquely defined by the parameters A and α (A = A 1 +A 2 and α = A 1 /(A 1 +A 2 )). We re-parameterize the model as in (2) to express the null hypothesis with respect to one parameter α rather than two parameters A 1 and A 2. Under this modeling framework, the null hypothesis is H : {α = or α = 1 or ω 1 = ω 2 } where ω 1 = (ω 11,..., ω d1 ) τ and ω 2 = (ω 12,..., ω d2 ) τ are the location parameters of the components that are tested for mixture. For testing the null hypothesis of one regression component, we assume that the width parameters τ s and the error variance σ 2 are known and fixed. The assumption of fixed σ 2 implies that the signal-to-noise ratio is fixed and therefore the false discovery rate of the regression components is also fixed. At large values of σ 2, small amplitude components will be non-detectable from the noise level; that is, σ 2 needs to be well below A max{α,1 α} d. On the s=1 τ s other hand, the width parameters change the shape of the regression function. For fixed parameters α and A but small values of τ s, the heights of the two components decrease and the tails become fatter which reduces the identifiability of the model parameters. Therefore, the assumption of fixed widths ensures some level of identifiability of the regression components. 4

5 The assumption of fixed width parameters and error variance is common practice (Koradi et al., 1998; Malmodin et al., 23). These parameters are commonly estimated from the complete data (all regression components) before applying the hypothesis testing procedure as we discuss in Section 3. Estimating these parameters from the complete data rather than locally will provide more accurate estimates since we pull information from multiple regression components. We highlight here that the model (1) is not a mixture of regressions but a regression model in which the regression function is a sum of weighted components, and therefore, the existing research on detecting mixture of densities (Titterington et al., 1985; Müller and Sawitzki, 1991; Roeder, 1994; Chen and Kalbflisch, 1996;, Walther, 24; Chen and Li, 29; Li et al., 29 and the references therein), does not apply to the statistical framework introduced in this paper. A widespread test statistic used in detecting density mixtures is the likelihood ratio test (LRT). Similar to the problem of detecting mixtures of densities, LRT has its limitations under the regression framework discussed in this paper. Under the null hypothesis, the model parameters are non-identifiable since the null hypothesis depends on two parametric conditions involving α, ω 1 and ω 2. Moreover, the (mixing) proportion parameter α is on the boundary of the parameter space since it takes values or 1. Due to these two irregularities, standard likelihood-based estimation (nonlinear least squares estimation under normality assumption) provides inconsistent and unstable parameter estimators (Jennrich, 1969). An alternative approach to the LRT is to place a prior distribution on the parameter α, and similarly to Aitkin and Rubin (1985), the mixing proportion becomes a nuisance parameter which is further integrated out. However, under the regression framework one regularity condition for the asymptotic distribution of the likelihood ratio test does not hold 5

6 for the Aitkin and Rubin (1985) approach (Serban, 25). Different re-parametrizations of the mixture regression model may also provide identifiable parameters under the null hypothesis. For mixture of densities, re-parameterization has been investigated under the assumption that the true null parameter is known (Lemadani and Pons, 1999) or that the true parameter is unknown (Dacunha-Castelle and Gassiat, 1997; Liu and Shao, 23). Other approaches impose constraints on the density means (Ghosh and Sen, 1985) or mixing proportion parameters (Chen et al., 21 and Chen and Kalbfleisch, 25). In detecting mixtures in the multi-dimensional regression model described in this paper, we proceed with the latter approach; that is, we constrain the mixing proportion parameter α to be away from the boundary values under the null hypothesis. This constraint overcomes both the identifiability problem and the estimation of the mixing proportion on the boundary of the parameter space. The constraint is defined through a penalty function p(α) which attains its maximum at α = 1. Under the null hypothesis, the penalty term is. To obtain 2 the penalized LR, we therefore need to employ penalized nonlinear regression under the null and alternative hypotheses. Consequently, we introduce an estimation approach that uses the idea of an Expectation-Maximization type algorithm (Chen and Li, 29; and Li et al., 29). In this paper, we show the consistency of the parameter estimates under the proposed estimation approach. One difficulty in penalized likelihood ratio (PLR) testing for mixture detection is deriving the (asymptotic) distribution of the test statistic under the null hypothesis. For density mixture detection, the test statistic is commonly distributed according to an asymptotic mixture distribution.5χ 2 +.5χ 2 1 (Chen and Kalbflisch, 1996; Chen et al., 21; Chen and Li, 29; and Li et al., 29). In contrast, in hypothesis testing for mixtures in the regression function, the asymptotic distribution of the PLR test statistic under the null hypothesis 6

7 doesn t have a closed form expression. In this paper we investigate a test statistic derived using a similar approach to the EM test statistic recommended by Chen and Li (29) and Li et al. (29). Because the asymptotic distribution of the EM-test does not have a close form expression, we derive a quadratic approximation for the distribution of the EM-test statistic and we use this distribution to obtain the quantiles of the null distribution of the test statistic using sampling techniques. We discuss the testing procedure in Section 2 along with asymptotic results in Section 2.2. In Section 3, we describe the application of mixture test to identification of overlapped regression components in the more general model for L > 2. In a simulation study in Section 4, we evaluate the accuracy (type I error) and the efficiency (power) of the testing procedure. The statistical application investigated in this paper is pertinent to the study of three-dimensional protein structure determination using NMR. We introduce the NMR biomolecular studies in Section 5 and the application of the proposed testing procedure to two and three-dimensional NMR data in Section Testing Procedure Define the likelihood function l(α, A, ω 1, ω 2 ) for the regression model in (2). Under the assumption of normal error distribution, the log-likelihood function is proportional to 1 2σ 2 M 1,...,M d i 1 =1,...,i d =1 (Z i1,...,i d Aαs (x i1,..., x id ; ω 1, τ ) A(1 α)s (x i1,..., x id ; ω 2, τ )) 2, (3) where α is a weight parameter and must be in the interval [, 1]. We penalize the loglikelihood function using a penalty p(α) which is a continuous function such that it is maximized at.5 and goes to negative infinity as α goes to or 1. Without loss of generality we 7

8 assume that α [,.5]. Define the penalized log-likelihood function as follows: pl(α, A, ω 1, ω 2 ) = l(α, A, ω 1, ω 2 ) + p(α). (4) 2.1 EM-test Statistic In this section, we introduce a test statistic which measures the discrepancy between the null hypothesis H of one regression component and the observed data. The test statistic is a version of the EM-test introduced by Chen and Li (29) and Li et al. (29) but extended to our regression framework. The primary motivation for using this test statistic is the efficiency of the LRT under the assumption of fixed α = α. Since α is not fixed, we instead proceed with an EM-type algorithm in which at the M step we assume α is fixed and estimate the other model parameters, and at the E step we update the parameter α; then repeat the E and M steps for obtaining a more suitable mixing proportion which improves the power. Only a small number of iterations, K, is used. The asymptotic approximation of the null distribution holds under finite K. The procedure to derive the EM-test statistic is described below. Step : Estimate the scaling parameter A and the location parameters ω 1 = ω 2 under the null model by maximizing the penalized likelihood function in (4). Denote (Â, ˆω ) = arg max pl(α =.5, A, ω, ω). Step 1: In the next steps, we obtain estimates for the amplitude and location parameters under the alternative hypothesis given initial values for the mixing proportion. Choose a number of initial values for α: α 1,..., α J in (,.5]. For each α value, we obtain estimates for the parameters A, ω 1, ω 2 using an iterative algorithm similar in its steps to the EM algorithm. Let α () = α for = 1,..., J and start with the first iteration k = 1 where 8

9 k K (K is the maximum number of iterations). Step 1.1. Estimate A, ω 1, ω 2 by maximizing the likelihood function holding fixed the proportion parameter α = α (k), i.e., (A (k), ω (k),1, ω(k),2 ) = arg max A, ω 1, ω 2 pl(α (k 1), A, ω 1, ω 2 ). Step 1.2. Update the mixing proportion parameter α (k) by minimizing C 1 = 1 2σ 2 C 2 = 1 2σ 2 M 1,...,M d i 1 =1,...,i d =1 M 1,...,M d i 1 =1,...,i d =1 g(α) = C 1 α 2 2C 2 α p(α) where (5) { A (k) } 2 s(x i1,..., x id ; ω (k),1, τ ) A(k) s(x i1,..., x id ; ω (k),2, τ ) { } Z i1,...,i d A (k) s(x i1 s(x i1,..., x id ; ω (k),2, τ ) { A (k) } s(x i1,..., x id ; ω (k),1, τ ) A(k) s(x i1,..., x id ; ω (k),2, τ ). Step 1.3. Let k = k + 1, and iterate Step 1.1 and Step 1.2 if k < K. Step 2: For = 1,..., J, define the statistic which depends on the initial value α, M (K) (α ) = 2{pl(α (K), A (K), ω (K),1, ω(k),2 ) pl(.5, Â, ˆω, ˆω )}. The EM-test statistic is defined as EM (K) = max{m (K) (α ), = 1,..., J}. We reect the null hypothesis if EM (K) is greater than a specified critical value which is determined from the limiting distribution of the test statistic (see Section 2.2). In our empirical studies, the maximum number of iterations is K = 1. A larger number of iterations does not enhance the efficiency of the testing procedure. For the EM-test statistic, we need to specify the initial values of the mixing parameter, the penalty p(α) and the penalizing constant C. In general, we recommend start- 9

10 ing with a small number of initial values for the mixing parameter, for example, α {.1,.2,.3,.4,.5}. Our empirical studies showed that a larger number of initial values for the mixing parameter does not improve the accuracy of the testing procedure. When selecting the penalty function p(α), we need to consider two important criteria. First, the equation used to update the mixing parameter in (5) does not have a minimum for every penalty function. We show in the Appendix that there exists a unique minimum to this equation for two commonly used penalties p(α) = C log(1 1 2α ) (in Li et al., 29) and p(α) = C log(4α(1 α)) (Chen et al., 21) for any constant C >. Second, the choice of the penalty function impacts the trade-off between the Type I error and the power of the test as well as the accuracy of the approximation under the null hypothesis. The penalty p(α) = C log(1 1 2α ) is a lasso-type penalty (Li et al., 29); it is a continuous function for all α.5, the null value. As pointed out by Li et al. (29) and Chen and Li (29), this penalty has similar properties to the lasso-type penalty for linear regression; consequently, the probability of the fitted value of α = 1/2 is positive. In contrast, the penalty p(α) = C log(4α(1 α)) is smooth for all α. Following the suggestion in Li et al. (29) we use p(α) = C log(1 1 2α ) in the implementation of the EM-test procedure. The constant C controls the penalization of the LRT when α 1 or α. The selection of the penalty constant C significantly affects the reliability (type I error) and the precision (power) of the test. The smaller C is, the less penalized the null hypothesis is leading to a high type I error but enhanced power. In our empirical studies, the optimal value for C is of order σ 2 log(n)n where σ 2 is the variance of the error which is assumed known and N is the sample size of the data - using the notation in equation (2), N = M 1 M 2... M d. The intuition behind this penalty constant is as follows. Assume that we know the true parameters A, α, ω 1, ω 2 and replace them in the least squares sum in (3). The result is a 1

11 sum of N squared normal errors. We guide our selection for C using the following property of a sequence of normal random variables ( P sup X i > σ ) 2 log(n) 1 as n. i=1,...,n Although this result doesn t provide a limiting condition for N i=1 X2 i, it can be used to assess its magnitude which is in the range of 2Nσ 2 log(n). An alternative test statistic that can be used to detect mixtures in the regression function is the modified likelihood ratio test (MLRT) in the spirit of Chen et al. (21). The MLRT can be viewed as a limiting case of the EM-test statistic when the iteration number K goes to. Theorem 1 below indicates that a finite number of iterations will only change the parameter estimates by o p (1). Therefore, Theorem 1 guarantees under finite number of iterations that the estimators of A, ω 1 and ω 2 are in a small neighborhood of the true parameter values; therefore, the quadratic approximation of the EM-test becomes valid. On the other hand, based on the same result, we cannot infer how the test statistic behaves when K ; an infinite number of iterations may lead to intractable theoretical results. 2.2 Asymptotic Results We first introduce the notation that is used in the asymptotic results in this section. Define U i1,...,i d = V i1,...,i d ; s = T i1,...,i d ; st = 1 d s=1 {(x is ω s ) 2 τ 2 s + 1}, 2(x is ω s )/τ s {(x is ω s ) 2 τs 2 + 1} U, W i1,...,i d ; s = 3(x is ω s) 2 τs 2 1 {(x is ω s ) 2 τs 2 + 1} U 2, 4(x is ω s )(x it ω t )/(τ s τ t ) {(x is ω s ) 2 τ 2 s + 1}{(x it ω t ) 2 τ 2 t + 1} U where ω = (ω 1,..., ω d ) τ is the location of the regression component under the null hypothesis. The notation above is based on the assumption that the shape function takes 11

12 the form of the Lorentzian function. For other shape functions, U i1,...,i d is the regression function under the null hypothesis, V i1,...,i d ; s is the first derivatives of the shape function, and W i1,...,i d ; s and T i1,...,i d ; st are the second derivatives of the shape function. Further denote the vectors a i1,...,i d = (U i1,...,i d, V i1,...,i d ;1,..., V i1,...,i d ;d) τ b i1,...,i d = (W i1,...,i d ;1,..., W i1,...,i d ;d, T i1,...,i d ;12,..., T i1,...,i d ;d 1d) τ c i1,...,i d = (a τ, b τ ) τ. Using the notation above, we define the matrix B = M 1,...,M d i 1 =1,...,i d =1 c i1,...,i d c τ (6) which can be further decomposed into B 11 = M 1,...,M d i 1 =1,...,i d =1 a i1,...,i d a τ, B 21 = B τ 12 = M 1,...,M d i 1 =1,...,i d =1 b i1,...,i d a τ, B 22 = M 1,...,M d i 1 =1,...,i d =1 b i1,...,i d b τ. Further, we denote B 22 = B 22 B 21 B 1 11 B τ 21. The asymptotic properties depend on the following three conditions. C1 The penalty function p(α) is a continuous function such that it is maximized at.5 and goes to negative infinity as α goes to or 1. C2 The matrix 1 N B converges to some positive matrix as N = s M s. C3 There exists a W > such that ω s1, ω s2 [ W, W ] for s = 1,..., d. In fixed design non-linear regression models, one regularity condition is that the matrix, 1 B N 11 converges to a positive definite matrix. Assumption C2 is stronger than this assumption. It is related to the strong identifiability condition proposed in Chen (1995). This condition 12

13 ensures that the estimates of ω 1 and ω 2 have the optimal convergence rate. It can be verified that if the design points x is for i = 1,..., M i are evenly spaced over [,1], then Assumption C2 is satisfied. Theorem 1 ensures that under the null hypothesis the penalized likelihood approach provides consistent parameter estimators. Theorem 1. Assume conditions C1, C2 and C3 are satisfied. Under the null hypothesis, we have, for = 1,..., J and any fixed finite K, p(α (K) ) = O p (1), A (K) A = O p (N 1/2 ), ω (K),1 ω = O p (N 1/4 ), ω (K),2 ω = O p (N 1/4 ). Based on this consistency result, we derive the following quadratic approximation of the EM-test statistic. Because the asymptotic distribution of the penalized likelihood ratio test does not have a close form expression, we use this approximation to obtain the null quantiles. Theorem 2 Assume conditions C1, C2 and C3 are satisfied and α 1 =.5. Let w be a multivariate normal random vector with mean vector and covariance matrix B 22. Under the null hypothesis, we have, for any fixed finite K, as N, { } EM (K) = sup θ 2θ τ w θ τ B22 θ + o p (1) with θ = (θ 2 1,..., θ 2 d, θ 1θ 2,..., θ d 1 θ d ) τ and (θ 1,..., θ d ) R d. The distribution of the quadratic approximation is difficult to find. To overcome this difficulty, we suggest the following algorithm to simulate the null distribution quantiles. 1. Generate S random vectors, { w (s), s = 1,..., S }, from multivariate normal distribution with mean and covariance matrix B

14 } 2. For each vector w (s), calculate Q s = sup θ {2θ τ w (s) θ τ B22 θ. Compute the quantiles of the Q 1,..., Q S and use them to approximate the quantiles of EM (K). The proofs of Theorem 1 and Theorem 2 are in the Appendix of this paper. Remark 1 Our asymptotic results apply to penalties such that p(.5) = as long as C >. If the penalty constant is selected such that C N as N, for example the suggested penalty constant C N = σ 2 N log(n), the penalty term converges to in probability and does not contribute the EM-test statistics. This will not change the asymptotic properties of the EM-test statistics. Remark 2 Since our testing problem in (2) is not a homogeneity test in mixture of densities but a homogeneity test in the mean function of a non-linear regression, the asymptotic results in Li et al. (29) and Chen and Li (29) do not apply to our testing problem generally. However, when d = 1, the limiting distribution of the EM (K) for testing homogeneity in the mean function of a non-linear regression is.5χ 2 +.5χ 2 1, which is the same as the one for detecting homogeneity in mixture of densities (Li et al., 29). 3 Application of Mixture Test In real applications of the regression model in (1), the primary goal is to accurately identify the regression components and estimate the model parameters. Existing research in this field commonly applies a local maxima identification method to the multiple component data (Serban, 27, 21 and the references therein) followed by manual intervention for assessing the regression components. In our notation, a local maximum is at location (x i1,..., x id ) if Z i1,...,i d is larger than its immediate intensity neighbors Z i1 ±1,...,i d,..., Z i1,...,i d ±1. A visual 14

15 equivalent in one- and two-dimensional data of a local maximum is a peak/spike. The set of local maxima are initial candidates of the regression components in model (1). A local maximum may correspond to zero (false positive), one or more regression components. We propose using the mixture test introduced in this paper to assess whether a local maximum corresponds to one or more regression components. The general procedure for the application of the mixture test to the more general model (1) takes three steps: Step 1. We first apply a local maxima identification method. We reduce the complete intensity data to smaller regions, each region corresponding to the surrounding intensity values of one local maximum. This data segmentation step involves an intermediary step at which we apply wavelet-based denoising; local maxima identification is applied to the denoised data to reduce the number of false positives (Serban, 21). At this intermediary step, we also estimate the noise variance σ 2 using the mean absolute deviation estimator suggested by Donoho (1995). Step 2. We fit a model with one regression component (L = 1) to each local maximum region and obtain estimates for τ. We obtain a common estimate for the width parameter by taking the median over all estimates of τ - this estimate of τ relies on the assumption that only a few regression components are mixed. This step also provides initial estimates of the amplitude and the location parameters. Step 3. Fixing σ 2 and τ at their estimates obtained in the previous two steps, we apply the testing procedure to each region where the regions could overlap. The size of a region in a d-dimensional design is M d where M is equal to the closest integer of the 9% quantile of the Cauchy distribution with spread min s=1,...,d {ˆτ s }. For each region, we decide whether it contains two regression components, i.e. reect the null hypothesis. Remark 3 In Step 3, the values of σ 2 and τ are fixed at the same values for all local maxima. 15

16 Therefore, the asymptotic results for fixing σ 2 are more relevant for this application of the testing procedure. Remark 4 It is important to note that a number of the discoveries or local maxima are false positives, which arise due to the additive noise, and therefore, the mixture test will be applied to true as well as to false positives. In our experiments (simulation and real data), when the mixture test is applied to data consisting of noise only (false positives or A = ), we consistently have convergence problems in Step 1.1 of the EM-test statistic computation. We can use this observation as an indication of the presence of a false positive which can be used in screening out the false positives. 4 Simulation Study In this section, we discuss a simulation study for evaluating the reliability (type I error) and efficiency (power) of the peak mixture test. The significance level is 1 α =.95. We compare the testing procedure introduced in this paper for varying values of the penalization constant C and with an additional simpler testing procedure, called overdispersion test. Overdispersion test. The over-dispersion test is often used to test one-component versus more than one-component or mixtures of densities (Neyman and Scott, 1966; Lindsay, 1995). Its extension under our specific testing problem is as follows. Let S(A, ω) denote the least squares sum: S(A, ω) = i 1,i 2,...,i d ( d ) A/ s=1 Z τ s i1,i 2,...,i d d s=1 {(x is ω s1 ) 2 τs 2 + 1} 2. Under the null hypothesis, S(A, ω) only contains the information about random variation, whereas under the alternative hypothesis, S(A, ω) contains two types of information: random 16

17 variation and the variation due to the misspecification of the mean function. Assuming σ known, we can use S(A, ω) to construct a test statistic as follows: T = S(Â, ˆω )/σ 2 (N 1 d) 2(N 1 d). Under null hypothesis, the test statistic is asymptotically normal as N. Specifically, if the data is from one-component model, then T converges to N(, 1). If the data is from a model consisting of more than one component, then T tends to be large. Therefore we reect the null hypothesis of one regression component if T is greater than the upper quantile of N(, 1). Simulation Setting. We simulate data following the general model in equation (1). In this model, the parameters of the lth component are the location parameter ω l = (ω 1l,..., ω dl ) τ, width parameter τ l = (τ 1l,..., τ dl ) τ, and amplitude parameter A l. We simulate data in two (d = 2) and three (d = 3) dimensions. In our simulation study, we assume that the mixture regression function s is the Lorentzian function. Therefore, the simulation model is Z i1,...,i d = L l=1 ( d ) A l / s=1 τ sl d ( s=1 (xs ω sl ) 2 τ 2 sl + 1 ) + σϵ. (7) The simulation parameters are as follows. The amplitudes A l, l = 1,..., L vary in the interval of values [1, 1] and the noise standard error is set to σ = 2, 5, 1 or 15. The width parameters are τ 1 =.4 and τ 2 =.6 for the two-dimensional simulation while adding τ 3 =.8 for three-dimensional simulated data. In this example, the number of Lorentzian components is L = 5 on a grid of points for d = 2 and for d = 3 (i.e. for the two-dimensional study, x i1 {1/512,..., 1} and x i2 {1/256,..., 1}). We simulate the error term from ϵ i1,...,i d N(, 1) independently. 17

18 Reliability. To evaluate the reliability of the EM-test introduced in this paper, we simulate from the model in (1) a dataset with L = 5 well separated regression components; that is, the distance between any two regression components i and is ω i ω > 1 for i. We first estimate the width and variance parameters from all 5 regression components following the first two steps mentioned in Section 3. We apply the EM-test to the region of each regression component as in step 3 in Section 3 and compare the EM-test values with the 95% approximate quantile of the null distribution (see Section 2.2 for the procedure used to compute the null quantiles with S = 1). The type I error is computed as the proportion of reections among the 5 regression components. The penalty function used in computing the EM-test statistic is p(α) = C log(1 1 2α ) with varying penalizing constants: C = Nσ 2 log(n) (optimal), C = Nσ 2 (medium) and C = σ 2 log(n) (small). The initial values for the mixing proportion are α {.1,.2,.3,.4,.5}. Figure 3 shows the type I error for d = 2 and d = 3 simulations. For both d = 2 and d = 3, the type I error is between.5-.1 for medium values of σ 2 at C = Nσ 2 log(n). As the penalizing constant decreases, the test becomes significantly unreliable. Efficiency. To evaluate the accuracy of the EM-test introduced in this paper, we simulate 25 pairs of regression components - a total of L = 5 components. The distance between two regression components in a pair is 2 < ω i ω < 4. That is, we generate only from the alternative hypothesis with varying mixing proportions and varying distances between the regression components. We estimate the variance parameter from all 5 regression components but use the true width values. We apply the EM-test to the region of each pair of regression components and compare the EM-test values with the 95% approximate quantile of the null distribution (see Section 2.2 for the procedure used to compute the null quantiles with S = 1). The power is computed as the proportion of reections among 18

19 the 25 pairs of regression components. Figure 3 shows the power for d = 2 and d = 3 simulations. The power is significantly higher for two-dimensional simulated data - there is a slight difference across varying penalizing constant with a lower power for the optimal C = Nσ 2 log(n). In contrast, for three-dimensional simulated data, the difference in the test power for varying penalizing constants is much larger. Discussion. By varying the constant C, there is a significant reduction in the type I error. The difference in type I error can be as high as.35 for two-dimensional data and.25 for three-dimensional data. The difference is less substantial for three dimensional data, because as dimensionality increases, there is more information (larger sample size) to be used in deciding whether there is one component or more. For example, in our implementation, the size of a region surrounding a local maximum is N = 17 2 = 225 for two dimensional data and N = 17 3 = These results validate our asymptotic theory. On the other hand, the power is not as sensitive to different values of C as the type I error. The largest difference for three-dimensional data is about.1. The power also decreases as the signal-to-noise ratio decreases. An explanation for this is that at large noise level, small amplitude components will be non-detectable from the noise level. When comparing with the simpler approach, the overdispersion test, the power is higher than for the optimal C for both two and three-dimensional simulated data. However, the type I error is significantly higher for three-dimensional data when comparing across all C values, and only slightly higher than the optimal C for two-dimensional data. The poor performance for the three-dimensional design is because of the poor separation of the distribution of S(Â, ˆω ) under the null and alternative hypotheses (see Figure??). We therefore conclude that although we may gain some power by using the overdispersion test, this gain is offset by the loss in the test reliability especially for higher dimensional data. This is critical for our 19

20 application since a large number of false positives in higher dimensional NMR data could lead to significant distortion of the predicted protein structure. (a) 2D Simulation (b) 3D Simulation Type I Error Optimal C Medium C Small C Over Dispersion Type I Error Optimal C Medium C Small C Over Dispersion Standard Deviation Standard Deviation Figure 2: Type I error calculated over 5 regression components generated under the null hypothesis. Reliability vs. Efficiency. In conclusion, as the dimensionality and the error variance increase, the test efficiency decreases also. We can enhance the efficiency by decreasing the penalizing constant but this in turn reduces the reliability of the test. This simulation study clearly illustrates how the penalizing constant controls the trade-off between reliability and efficiency. From this simulation as well as from other simulations not reported here we recommend using the constant C = Nσ 2 log(n). We have performed other simulation studies with a larger number of initial values for the mixing proportion parameter as well as with a different penalty function, p(α) = C log(4α(1 α)). From this extensive simulation study, we therefore conclude that the most important 2

21 (a) 2D Simulation (b) 3D Simulation Power Optimal C Medium C Small C Over Dispersion Power Optimal C Medium C Small C Over Dispersion Standard Deviation Standard Deviation Figure 3: Test power calculated over 5 regression components generated under the alternative hypothesis. tuning parameter in the application of the EM-test remains the penalizing constant C. 5 Case Study 5.1 Motivation and Background In NMR data analysis for biomolecular studies, one primary obective is to estimate parameters (e.g. chemical shifts) of the atomic nuclei of a protein. Under protein magnetization, targeted atomic nuclei in the protein undergo energy transfers; each energy transfer induces a signal which is mathematically described by a decaying sinusoid (see Supplemental material). Therefore, the NMR signal generated by a d-dimensional NMR experiment is a sum 21

22 (a) σ = 2 (b) σ = 15 Density e+ 2e 6 4e 6 6e 6 8e 6 Density e+ 1e 4 2e 4 3e 4 4e 4 5e 4 e+ 1e+5 2e+5 3e Figure 4: The density estimate of the distribution of S(Â, ˆω ) under the null (solid line) and alternative (dotted line) hypotheses. of noisy decaying sinusoids S(t 1, t 2,..., t d ) = ( L d ) A l e iϕ l e t s/τ sl e it sω sl + ϵ t1,...,t d (8) l=1 s=1 where each sinusoid is generated by an energy transfer between d atomic nuclei in d- dimensional NMR experiments (Hoch and Stern, 1996). The model parameters of interest are the resonance frequencies ω l = (ω 1l,..., ω dl ) τ (translated into chemical shifts), and the signal amplitudes A l (translated into structural distance of the atomic nuclei in specific NMR experiments). Also L is the number of observed energy transfers, which is large and unknown. The protein structure is resolved by accurately estimating the resonance frequencies and the signal amplitudes from data generated by NMR experiments. 22

23 The traditional methodology in biomolecular NMR data analysis involves Fourier transformation (FT) of the NMR signal data complemented by other pre-processing steps (Hoch and Stern, 1996). After Fourier Transform, the resulting model is a d-dimensional mixture regression model as described by the model in equation (1) where the shape function is an approximate Lorentzian function (Serban, 25). In this model, the parameters of the lth regression component are the location parameter ω l = (ω 1l,..., ω dl ) τ, which are the signal frequencies, width parameter τ l = (τ 1l,..., τ dl ) τ, and amplitude parameters A l. Because of the one-to-one mapping between energy transfers and regression components, the problem of identifying the parameters of the atomic nuclei undergoing energy transfers translates into accurately identifying and estimating the model parameters. Importantly, in multi-dimensional NMR frequency data, since a large number of regression components have similar frequencies, we expect that some components partially overlap or mix. It is important to de-mix the components since each regression component provides specific information about the structure of the protein. In certain cases, the lack of a small number of essential components can lead to a significant deviation of the predicted structure (Güntert, 23). Many of the existing software packages for biomolecular NMR data analysis incorporate routines for component identification but most of them do not incorporate a routine for detecting mixtures (Güntert, 23; Gronwald and Kalbitzer, 24). The common practice for detecting mixed or overlapped components is by visualizing the contours of all local maxima and manually selecting the ones that display significant mixing. However, manual intervention is tedious and time consuming since generally L is large. The testing procedure in this paper is a knowledge-based means for detecting a small number of candidate overlapped components that can be investigated visually. This procedure will reduce the time spent to 23

24 manually select mixed components by at least ten folds. Remark 5 In Section 1, we mention that the assumption of equal widths is common in NMR biomolecular studies. The width parameters in the frequency domain map to the decaying times τ l, l = 1,..., L in the time domain. Since in many biomolecular NMR studies, we detect the exchange of energy between the same pair of nuclei (e.g. 1 5N-proton), the decaying times are similar, and therefore the width parameters are similar. Remark 6 Another assumption for the mixture regression test is that the location parameters ω l, l = 1,..., L are bounded (condition C3). For NMR biomolecular data, this assumption holds since the NMR signals are filtered such that the frequencies are within a specific bandwidth. 5.2 Application Data: 2D and 3D NMR Experiments Data Specifications. The experimental data used to illustrate the applicability of the mixture regression test are for a doubly-labeled sample of a 13 residue RNA binding proteinrho13 - using standard double (HSQC) and triple resonance (HNCOCA and HNCA) experiments on a 1 mm protein sample at a proton frequency of 6 MHz as introduced in Briercheck et al. (1998). The data were processed with FELIX (Accelrys Software Inc.) using apodization and linear prediction methods that are typical for these types of experiments. The HSQC spectrum is a 2D NMR data in which the regression components show correlations between the amide nitrogen and the attached amide proton in the protein sequence. The intensities Z i1,i 2 are observed over a two-dimensional grid of points with M 1 = 512 and M 2 = 256. The HNCOCA is a three-dimensional experiment generating 3D NMR data in which each component arises due to correlations between the amide nitrogen and amide proton of a specific residue and the alpha carbon of the preceding residue in the protein 24

25 Dataset No. of Local Maxima No. of Reections (EM-test) HSQC HNCA 235 HNCOCA 118 Table 1: Number of local maxima and the number of mixed regression components for the three application datasets. sequence. HNCA is also a three-dimensional experiment in which components are paired with similar amide nitrogen and amide proton frequencies. In HNCA, a pair of components arises due to correlations between the amide nitrogen, amide proton and the alpha carbon nuclei of the preceding residue and of the intra-residue. Therefore, for this experiment, the true number of regression components will be about twice the number of protein residues and half of the components will match the components in HNCOCA. For both HNCOCA and HNCA NMR datasets, the intensity values Z i1,i 2,i 3 are observed over a grid of points of size Mixture Test Application. Using the local maxima identification method briefly described in Section 3 and introduced in Serban (27, 21), we identified 146 local maxima for the HSQC method, 235 for the HNCA method and 118 for the HNCOCA method. We applied the EM-test to the two-dimensional and three-dimensional data generated by these three NMR experiments for the RNA binding protein- rho13. For computational simplicity and effectiveness, the mixture test was applied only to data in the neighborhood of a local maximum as specified in Section 3 (step 3). The width parameters τ s, s = 1,..., d were estimated by estimating the widths of all local maxima and then taking the median over all estimates. The number of reections (mixed regression components) along with the number of local maxima identified using the method in Serban (27, 21) are provided in Table 1. In this study because the error variance is unknown, the estimated variance is the median 25

26 absolute value of the high resolution wavelet coefficients after wavelet transformation of each of the three datasets (Serban, 21). The regression component widths are estimated by first interpolating a Lorentzian function at each local maximum; the common width estimate is then the median over the widths estimated from this interpolation step. The contour plots for the nine reected regression components for the HSQC data are in Figure 4. A visual inspection of the contour plots further suggests which of the nine reections are true negatives. The contour plots 1-3 and 9 are asymmetric and larger in size - a pattern that reveals a mixture of two components overlapped into one local maximum. The five contour plots 4-8 are symmetric, all with similar widths in both directions. In the supplemental material we complement the contour plots with perspective plots which show the shape and the width of the nine reections. We complemented the analysis of two-dimensional data with manual intervention - visually assessment of each of the 146 local maxima. Through this experiment we didn t identify more than the four mixtures discussed above which shows that the application of the mixture test does minimizes the effort of detection of overlapped or mixed components by reducing the number of screened local maxima from 146 to only 9. For the two three-dimensional datasets, we do not reect the null hypothesis for any local maxima. The reason for not detecting any mixture of regression components is that the protein in this study has a small number of residues leading to a small number of components (L 26 for HNCA and L 13 for HNCOCA) which are spread over a three dimensional data leading to higher resolution. Generally, one advantage of higher dimensional NMR data is an increase in resolution measured by the distance between components, which will result in a smaller number of mixed components than in lower dimensional data. On the other hand, the drawback of higher dimensional data is lower signal-to-noise ratio - a smaller number 26

27 ( 1 ) ( 2 ) ( 3 ) ( 4 ) ( 5 ) ( 6 ) ( 7 ) ( 8 ) ( 9 ) Figure 5: The contour plots for the nine reections from the application of the EM-test to the HSQC data. of identifiable regression components from the noisy background - in our application, we underestimate the number of components for the two three-dimensional datasets. Therefore, complementing the NMR data analysis with a mixture test will increase the ability to identify regression components with lower dimensional NMR experiments which have not only higher signal-to-noise ratio but they are less time-consuming and less expensive. 27

28 Appendix A: Existence of a Solution to g(α) An updated value of α given that A = A (k), ω 1 = ω (k),1 and ω 2 = ω (k),2 are fixed, is derived from maximizing penalized log-likelihood function with respect to α. This is equivalent to minimizing g(α) in equation (6) in the main manuscript with C 1 = 1 2σ 2 C 2 = 1 2σ 2 ( A (k) d / s=1 τ s ) ( A (k) d / s=1 τ s } d s=1 {(x is ω (k),s1 )2 τs d s=1 {(x is ω (k) ( A (k) d ) / s=1 Z τ s i1,...,i d } d s=1 {(x is ω (k),s2 )2 τs ) ( A (k) d / s=1 τ s d s=1 {(x is ω (k),s1 )2 τ 2 s + 1 } ),s2 )2 τ 2 ( A (k) d / s=1 τ s } s + 1 ) d s=1 {(x is ω (k),s2 )2 τ 2 s }. Penalty p(α) = C log(4α(1 α)). To obtain an updated value of α by minimizing g(α), we take the first order derivative of g(α) and equate to zero. The equation derived from equating the first order derivative when using the penalty p(α) = C log(4α(1 α)) is g (α) = 2C 1 α 2C 2 C(1 2α) α(1 α). Setting g (α) =, there exists a real solution in the interval (, 1) for C > since lim α g (α) = and lim α 1 g (α) = +. Moreover, the second order derivative is ( ) 2 (1 2α)2 2C 1 + C + α(1 α) α 2 (1 α) 2 which is positive for all α [, 1] since C 1. This implies that g (α) has a unique solution in [, 1] and this solution is the minimal point of g(α). 28

29 Penalty p(α) = C log(1 1 2α ). Let α = C 2 /C 1, which is the minimal point of C 1 α 2 2C 2 α. If α <.5, then the minimal point of g(α) should be in (,.5]. Note that when α (,.5], g(α) = C 1 α 2 2C 2 α C log(2α) and g (α) = 2C 1 α 2C 2 C/α. Setting g (α) =, we get a unique positive solution ˆα = 2C 2 + 4C C 1 C 4C 1. The second derivative over (,.5] is always positive. Therefore, for the penalty p(α) = C log(1 1 2α ), the updated value of α is α (k+1) = min {.5, 2C 2 + 4C C 1 C 4C 1 }. Similarly, when α.5, the updated value of α is α (k+1) = max {.5, 2C 1 + 2C 2 4(C 1 C 2 ) 2 + 8C 1 C 4C 1 }. In summary, α (k+1) = max { min.5, 2C 2+ {.5, 2C 1+2C 2 } 4C2 2+8C 1C 4C 1 } 4(C 1 C 2 ) 2 +8C 1 C 4C 1, if C 2 /C 1 <.5, if C 2 /C 1.5. Appendix B: Proofs of Theorems 1 and 2 Without loss of generality, we assume that ω s =, τ s = 1 and σ = 1. Let A = Z i1,...,i d d N(, 1). s=1 (x2 is + 1) 29

30 In the following, we use to denote M 1,...,M d i 1 =1,...,i d =1. A brief roadmap for the proofs is as follows. Lemma 1 shows that any estimator with α bounded away from or 1, and with a large likelihood value, is consistent for A, ω 1 and ω 2 under the null hypothesis. Lemma 2 strengthens Lemma 1 by providing specific convergence rates. Theorems 1 and 2 then follow directly from these two lemmas. Lemma 1 Assume the same conditions in Theorem 1. Let (ᾱ, Ā, ω 1, ω 2 ) be estimators of (α, A, ω 1, ω 2 ) such that δ ᾱ.5 for some δ (,.5]. Assume that l(ᾱ, Ā, ω 1, ω 2 ) l(.5, A, ω, ω ) c >. Then, under the null hypothesis, Ā A = o p (1), ω 1 = o p (1) and ω 2 = o p (1). Proof. Under the assumptions α [δ,.5], ω s1, ω s2 [ W, W ], only when A = A, ω 1 = ω 2 =, the mixture regression model reduces to the null model. That is, the parameters A, ω 1 and ω 2 are identifiable. Then Wald (1949) s idea can be applied to show that Ā A = o p (1), ω 1 = o p (1) and ω 2 = o p (1). For the convenience of presentation, let = Z i1,...,i d A d s=1 (x2 is + 1), which has N(, 1) distribution under the null hypothesis, and h i1,...,i d (A, ω) = A d s=1 {(x is ω s ) 2 + 1} A d s=1 (x2 is + 1). 3

31 In the following, we use to denote M 1,...,M d i 1 =1,...,i d =1. Lemma 2 Assume the same conditions in Theorem 1. Let (ᾱ, Ā, ω 1, ω 2 ) be estimators of (α, A, ω 1, ω 2 ) such that under the null hypothesis, Ā A = o p (1), ω 1 = o p (1), ω 2 = o p (1). Assume that pl(ᾱ, Ā, ω 1, ω 2 ) pl(.5, A, ω, ω ) c >. Then, under the null hypothesis p(ᾱ) = O p (1), Ā A = O p (N 1/2 ), ω 1 = O p (N 1/4 ), ω 2 = O p (N 1/4 ). Proof. Let R 1 (ᾱ, Ā, ω 1, ω 2 ) = 2{l(ᾱ, Ā, ω 1, ω 2 ) l(.5, A, ω, ω )}. Note that the penalty function is non-negative. It follows that 2c 2{pl(ᾱ, Ā, ω 1, ω 2 ) pl(.5, A, ω, ω )} R 1 (ᾱ, Ā, ω 1, ω 2 ). With the notation h i1,...,i d (, ), we can write R 1 (ᾱ, Ā, ω 1, ω 2 ) into the following form: R 1 (ᾱ, Ā, ω 1, ω 2 ) = 2 {ᾱh i1,...,i d (Ā, ω 1) + (1 ᾱ)h i1,...,i d (Ā, ω 2)} {ᾱh i1,...,i d (Ā, ω 1) + (1 ᾱ)h i1,...,i d (Ā, ω 2)} 2. Applying 2nd-order Taylor expansion to 1/ d s=1 {(x is ω s1 ) 2 + 1}, we get h i1,...,i d (Ā, ω 1) = (Ā A )U i1,...,i d + d Ā ω s1 V i1,...,i d ;s + s=1 + s<t Ā ω s1 ω t1 T i1,...,i d ;st + e i1,...,i d (Ā, ω 1), d Ā ω s1w 2 i1,...,i d ;s s=1 where e i1,...,i d (Ā, ω 1) is the reminder. Similar approximation can also obtained for h i1,...,i d (Ā, ω 2). 31

Noise Reduction for Enhanced Component Identification in Multi-Dimensional Biomolecular NMR Studies

Noise Reduction for Enhanced Component Identification in Multi-Dimensional Biomolecular NMR Studies Nicoleta Serban 1 The objective of the research presented in this paper is to shed light into the benefits