Spiking problem in monotone regression : penalized residual sum of squares

Spiking prolem in monotone regression : penalized residual sum of squares Jayanta Kumar Pal 12 SAMSI, NC 27606, U.S.A. Astract We consider the estimation of a monotone regression at its end-point, where the least square solution is inconsistent. We penalize the least square criterion to achieve consistency. We also derive the limit distriution for the residual sum of squares, that can e used to construct confidence intervals. keywords & phrases : prolem. monotone regression, penalization, residual sum of squares, spiking 1 Introduction Monotone regression gained statistical importance in the last decades, in several scientific prolems; e.g. dose-response eperiments in iology, modeling disease incidences as function of toicity levels, industrial eperiments like the effect of temperature on the strength of steel etc. The least squares estimator (LSE henceforth), for this prolem is descried y Roertson et al (1987). We refer to Wright (1981) for its derivation and Groeneoom et al (2001) for its distriution. In many of the prolems, including the ones mentioned aove, the value of the regression function at one of the end-points of the domain of the covariate is of special interest. For eample, 1 19, T.W. Aleander Drive, Research Triangle Park, NC 27709, U.S.A., jpal@samsi.info 2 Research supported y NSF-SAMSI cooperative agreement No. DMS-0112069 1

the response level at the asence of any dose of medicine or the disease incidence at zero toicity level (known as ase level or placeo) is a quantity that needs specific attention. However, the LSE fails to e consistent at the end-points and has a systematic non-negligile asymptotic ias. This is known as the spiking prolem. Similar prolem also arises in the estimation of decreasing densities, where the maimum likelihood estimator is inconsistent at its left end-point. This has een addressed y Woodroofe and Sun (1993), using a penalization on the likelihood. Recently, Pal (2006, Chapter 5) investigates the ehavior of the likelihood ratio in presence of the penalization in the density estimation. Analogously, here we apply the penalization on the least squares criterion and investigate the estimator at the end-point and the corresponding residual sum of squares. In Section 2 we formulate the penalized least squares prolem and characterize the estimators. We state the main result regarding the convergence of the penalized residual sum of squares to a limit distriution as well. In Section 3 we present the proofs of key propositions using technical arguments. 2 Main results 2.1 Preliminaries We consider the prolem of estimating the value of an increasing regression function at its left-end point. The prolem of decreasing functions or estimation at the right end point follows similarly, y reflecting the regression on the vertical or horizontal aes. For simplicity, we restrict ourselves to the unit-interval as the support of the co-variate, and take the design points as equally spaced, viz. i = i/n. Suppose we have oservations Y 1, Y 2,..., Y n at those design points such that Y i = µ( i ) + ɛ i for i = 1,..., n, and ɛ i are independent, identically distriuted errors with zero means and common variance σ 2. Assuming that the regression function µ is increasing in [0, 1], the least square estimate (LSE) (MLE in case ɛ i are Gaussian) of µ which minimizes (Y i µ( i )) 2 2

is the Pool-Adjacent-Violator-Algorithm (PAVA henceforth) estimator µ( k ) = min ma Y r +... + Y s s k r k s r + 1 Define µ i = µ( i ) for i 1, and µ i = µ( i ). Then µ(0+) = µ( 1 ) = min r 1 (Y 1 +... + Y r )/r is inconsistent for µ(0+). In particular P ( µ(0+) µ(0+)) 1, always yielding a negative nonnegligile ias. 2.2 Penalization on the least square One way to counter the prolem is to penalize for too low estimate of µ(0+). In particular, we look at the penalized least square (Y i µ i ) 2 2nαµ 1 (1) where α = α n is a penalty parameter that depends on the sample size n. For simplicity, we make that dependence implicit. To start with, we define the Cusum Diagram F of the data as follows. Let F r = (Y 1 +...+Y r )/n for r = 1,..., n. F is the piecewise linear interpolant of F 1,..., F n in the interval [0, 1] such that F (i/n) = F i. The following characterization of the penalized least square estimator, denoted ˆµ, is similar to the estimator of Woodroofe and Sun (1993). The proof is presented in Section 3. Proposition 1 ˆµ, the minimizer of 1 can e characterized as, F ()+α min 0 1, if k m 0 ˆµ( k ) = µ( k ), if k > m 0 where m 0 = n argmin 0 1 (F () + α)/. In particular ˆµ(0+) = min r F r+α r/n. The goal is to construct confidence sets for µ(0+), and for that we need sampling distriutions for the estimate ˆµ(0+). The estimator achieves consistency and its limit distriution is derived susequently. Alternately, we can use the residual sum of squares. However, the regular RSS is not relevant here ecause of the penalization. Under the constraint µ(0+) = c, a specific value, the penalized sum of square will increase, and reflect the effect of µ(0+). Hence, we focus on the 3

difference etween the constrained and unconstrained penalized sum of squares, and conclude that a large difference indicates a lack of evidence in favor of the hypothesis µ(0+) = c. To e more [ n ] [ precise, we measure Q = min µ,µ1 =c (Y n ] i µ i ) 2 2nαc min µ (Y i µ i ) 2 2nαµ 1 as a test statistic for that hypothesis. To find the constrained (penalized) least square solution, one would try to minimize the quantity n (Y i µ i ) 2 under the constraint c = µ 1 µ 2... µ n. Taking β as a Lagrange multiplier, we minimize (Y i µ i ) 2 2nβµ 1 (2) suject to µ 1 = c. The following characterization, similar to the penalized LSE, can e proved using similar arguments. See Section 3, for a proof. Lemma 1 ˆµ c, the minimizer of (2), can e characterized as, ˆµ c c, if k s 0 ( k ) = µ( k ), if k > s 0 where s 0 = n argmin 0 1 (F () c). For convenience, we define, s0 = s 0 /n and m0 = m 0 /n. 2.3 Limit distriution for the estimators We state the main result related to the asymptotic ehavior of the LSE and the RSS (oth penalized) in the following theorem. First, we need a few definitions. Define, G(t) = B(t) + t 2, and G 1 (t) = (G(t) + 1)/t, where B is a standard Brownian motion and let Z(t) e the right derivative process of the Greatest Conve Minorant (GCM) of the process G(t). Further, we define X = argming, X 1 = argming 1 and U 1 = inf G 1. Let, X 1 U 2 1 J = + X X 1 Z 2 (t)dt, if X 1 X X 1 U 2 1 X 1 X Z2 (t)dt, if X < X 1 We state the main result as follows. The proof is presented in Section 3. 4

Theorem 1 Suppose µ (0+) > 0 and define a = µ (0+)/2. Under the penalization α = (σ 2 /n 2 a) 1/3, we have, n 1/3 (ˆµ(0+) c) (aσ 2 ) 1/3 U 1 Q σ 2 J where denotes convergence in distriution. In the following, we state some preparatory results, with their proofs deferred until Section 3. They will e essential in showing Theorem 1. To investigate the large sample properties of Q, we need to derive limits of ˆµ and ˆµ c respectively. Equivalently, we look at the joint limit distriution of ( m0, ˆµ(0+), s0, µ). Our net proposition provides the result. Proposition 2 Suppose a and α are defined as in Theorem 1. We define a scaled version of the LSE, viz. Z n (t) = n 1/3 ( µ((σ 2 /na 2 ) 1/3 t) c). Then, as n, n 1/3 (ˆµ(0+) c) (aσ 2 ) 1/3 U 1 n 1/3 m0 (σ n 1/3 2 /a 2 ) 1/3 X 1 s0 (σ 2 /a 2 ) 1/3 X {Z n (t)} t 0 {Z(t)} t 0. The proof is deferred until the net Section. The last step efore proving Theorem 1 is to epress Q in terms of ( m0, ˆµ(0+), s0, µ), or equivalently (m 0, ˆµ(0+), s 0, µ). The proof of the following Lemma requires algeraic manipulation and several properties of the estimator µ. Lemma 2 Q can e epressed as follows : Q = m 0 (ˆµ(0+) c) 2 + s 0 ( µ i c) 2, if m 0 < s 0 m 0 (ˆµ(0+) c) 2 m 0 ( µ i c) 2, if s 0 m 0 Using Proposition 2 and Lemma 2, we can deduce Theorem 1 using continuous mapping theorem etc. The details are sketched in the proof in the net Section. 5

2.4 Discussion Theorem 1 shows consistency for the penalized LSE and gives the limit distriution for oth LSE and the RSS. This is analogous to the asymptotic normality of the usual LSE and the χ 2 convergence of the corresponding RSS in the regular parametric models. Moreover, this can e compared to the corresponding results in case of monotone regression at an interior point. There, the LSE converges to a non-normal distriution at a rate n 1/3, and the RSS or the likelihood ratio converges to a universal distriution different from χ 2. See, e.g. Banerjee and Wellner (2001). However, at the end-point, we get a completely different distriution, and the effect of µ (0+) can not e neglected. However, it should e noted that we need a consistent estimator of σ 2 as well as µ (0+) to achieve a confidence interval. Both of them are relatively tricky. Meyer and Woodroofe (2000) discusses a method that yields consistent estimator for σ 2 using the LSE µ. The estimation of µ (0+) can e done using similar techniques as in density estimation, discussed y Pal (2006). 3 Proofs 3.1 Proof of Proposition 1 We oserve, n (Y i µ i ) 2 2nαµ 1 = n (Ỹi µ i ) 2 2nαY 1 n 2 α 2 where Ỹ1 = Y 1 + nα and Ỹ i = Y i for i 2. Therefore the minimizer ˆµ is the PAVA estimator with new data Ỹ1, Y 2,..., Y n i.e. ˆµ k = min s k ma r k (Ỹr +... + Ỹs)/(s r + 1). Therefore, ˆµ(0+) = min r 1 = min r 1 = min r 1 = min 0 1 Ỹ 1 +... + Ỹr r Y 1 +... + Y r + nα r n(f r + α) r F () + α Moreover, ecept for the first lock (i.e. until the point where ˆµ i = ˆµ 1 ), the contriution of Ỹ1 does not influence the estimator ˆµ, and hence ˆµ k = µ k. Let m 0 denote the first reak-off point of the 6

estimator ˆµ. It is characterized as m 0 = argmin r 1 Y 1 +... + Y r + nα r = argmin n(f r + α) r = n argmin 0 1 F () + α The proposition follows. 3.2 Proof of Lemma 1 As in Proposition 2, ˆµ c is the PAVA estimator of a new data set Y 1 + n ˆβ, Y 2,..., Y n where ˆβ solves the equation ˆµ c 1 = min s 1 Y 1 +... + Y s + nβ s = c Therefore, ˆβ = ma s 1 (cs/n F s ) = ma 0 1 (c F ()). To see this, oserve that min s 1 Y 1 +... + Y s + n ˆβ s Y 1 +... + Y s + n ˆβ s c = c for all s with equality at least once, nf s + n ˆβ cs for all s with equality at least once, ˆβ cs n F s for all s with equality at least once, and the definition of ˆβ follows. Moreover, ˆµ c = c until its first reak-off point s 0, and equal to µ eyond that. Clearly, s 0 = argmin(f s cs/n) = n argmin 0 1 (F () c). 3.3 Proof of Proposition 2 Define the step function Λ n () = n /n. To prove the result, we invoke the Hungarian emedding for the limit of partial sums of i.i.d. random variales. See, e.g. Karatzas & Shreve (1991) for a discussion. Define the partial sums S r = ɛ 1 +... + ɛ r for r = 1,..., n. Assume that ɛ i s have finite moment generating function and common variance σ 2. Then, we can ascertain the eistence of a standard 7

Brownian Motion B n for each n, and a sequence of random variales T n having the same distriution as S n such that, sup T nt σ nb n (t) = O(log n) t almost surely. This result is also known as the strong approimation. Since we are looking at a distriution convergence, we treat T n and S n likewise, and replace one of them y other. We oserve that, F () = 1 n n n Y i + O p ( 1 n ) = 1 [µ( i n n ) + ɛ i] + O p ( 1 n ) = = = 0 0 0 µ(t)dλ n (t) + 1 n S nλ n() + O p ( 1 n ) µ(t)dt + σ n B n () + O p ( 1 n ) + O(log n n ) (µ(0) + tµ (0) + t2 2 µ (t ))dt + σ n B n () + O p ( 1 n ) + O(log n n ) = c + a 2 + o( 2 ) + σ n B n () + O p ( 1 n ) + O(log n n ) = c + a 2 + σ n B n () + R n () where the remainder R n () is negligile for calculations. Define = (σ 2 /na 2 ) 1/3 and B n(t) = B n (t)/. Then, m0 = argmin F () + α [c + a 2 + σ n B n () + α + R n ()] = argmin 0 1 = argmin 0 t 1 [c + a2 t 2 + σ n B n (t) + α] + o p () taking = t t = argmin 0 t 1 [a 2 t 2 + σ n B n(t) + α Using α = (σ 4 /n 2 a) 1/3, so that a 2 = σ /n = α, we conclude, m0 = ( σ2 na 2 ) 1 3 argmin0 t ( na 2 σ 2 ) 1 3 t 8 ] + o p (n 1/3 ) [ t 2 + B n(t) + 1 t ] + o p (n 1/3 )

and therefore, Moreover, i.e. ˆµ(0+) c = min = min n 1 3 m0 ( σ2 a 2 ) 1 3 X1 F () + α c a 2 + σ n B n () + α + R n () = min a t2 + B n(t) + 1 + o p (n 1/3 ) 0 t 1 t Consider the constrained estimator now. First, so that n 1 3 (ˆµ(0+) c) (aσ 2 ) 1 3 U1 s0 = argmin(f () c) = argmin [a 2 + σ ] B n () + R n () n = ( σ2 na 2 ) 1 3 argmint (t 2 + B n(t)) + o p (n 1 3 ) n 1 3 s0 ( σ2 a 2 ) 1 3 X. We look at the ehavior of µ, properly scaled, in the region ( s0, m0 ). We rememer that it is the right-derivative of the Greatest Conve Minorant of the Cusum diagram F () at the point = s. Approimating F () y a 2 + c + σ/ nb n () in the interval, we write, µ(s) = d d (GCM(a2 + c + ( σ n )B n ())) =s Using the scaling = t as efore, and the fact that the GCM of a process added with a linear term is same as the linear term added with the GCM of the process itself, we get, µ(t) c = 1 d dt (GCM(a2 t 2 + σ B n n(t))) = a d dt (GCM(t2 + B n(t))) since a 2 = σ n = ( aσ2 n ) 1 d 3 dt (GCM(t2 + B n(t))) 9

which yields, n 1 3 ( µ(t( σ 2 na 2 ) 1 3 ) c) (aσ 2 ) 1 3 Z(t) The remainders in all these calculations are negligile as we look at points those are at a distance O p (n 1/3 ) from zero. Moreover, since the individual distriutions are derived as limit of the same Brownian motion B n, the joint convergence follows. 3.4 Proof of Lemma 2 Proof : The following identities are useful to prove the Lemma. Y i = µ i = m 0 ˆµ 1 nα (A) Y i = µ i = cs 0 nβ (B) If m 0 < s 0 Y i = µ i Y i = µ i (C1) µ 2 i (C2) The first part of (A) follows from the properties of the PAVA estimator, which is constant over each lock, and equals the average of the Y -values in that lock. The second part follows similarly, since m 0 is the first reak-off point for Ỹ1, Y 2 etc, and therefore ˆµ 1 = (Ỹ1 +... + Y m0 )/m 0 = ( m 0 Y i + nα)/m 0. (B) and (C1) follows similarly. (C2) follows y splitting the summation over locks first, and then taking the sum of the Y -s in each lock. In case s 0 m 0, the equalities (C1) 10

and (C2) remain still valid, with the role of m 0 and s 0 eing interchanged. Now, Q = min µ,µ 1 =c = = [ (Y i µ i ) 2 2nαµ 1 ] min µ (Y i ˆµ c i) 2 2nαc s 0 (Y i c) 2 + [ ] (Y i µ i ) 2 2nαµ 1 (Y i ˆµ i ) 2 + 2nαˆµ 1 (Y i µ i ) 2 (Y i ˆµ 1 ) 2 (Y i µ i ) 2 + 2nα(ˆµ(0+) c) We use the identities (A) etc in the following two cases. Case 1: m 0 < s 0 Q = 2nα(ˆµ(0+) c) + [(Y i c) 2 (Y i ˆµ 1 ) 2 ] + = 2nα(ˆµ(0+) c) + 2(ˆµ 1 c) Y i + m 0 [c 2 ˆµ 2 1] + = 2m 0 (ˆµ 1 c)ˆµ 1 + m 0 [c 2 ˆµ 2 1] + = m 0 (ˆµ(0+) c) 2 + Case 2: s 0 m 0 Q = 2nα(ˆµ 1 c) + = 2nα(ˆµ 1 c) + 2(ˆµ 1 c) ( µ i c) 2 [(Y i c) 2 (Y i ˆµ 1 ) 2 ] + Y i + s 0 [c 2 ˆµ 2 1] + s 0 [(Y i c) 2 (Y i µ i ) 2 ] [c 2 2c µ i µ 2 i + 2 µ 2 i ] m 0 [c 2 2cY i µ 2 i + 2 µ i Y i ] [(Y i µ i ) 2 (Y i ˆµ 1 ) 2 ] = 2nα(ˆµ 1 c) + 2(ˆµ 1 c)(cs 0 nβ) + s 0 [c 2 ˆµ 2 1] (m 0 s 0 )ˆµ 2 1 + [ µ 2 i 2 µ 2 i ] + 2ˆµ 1 µ i = 2(ˆµ 1 c)(nα + cs 0 nβ + = m 0 (ˆµ(0+) c) 2 m 0 ( µ i c) 2, µ i ) + s 0 c 2 m 0 ˆµ 2 1 ( since [ µ 2 i ˆµ 2 1 2 µ i Y i + 2ˆµ 1 Y i ] ( µ i c) 2 + c 2 (m 0 s 0 ) µ i = m 0 ˆµ 1 nα cs 0 + nβ) 11

The lemma follows. 3.5 Proof of Theorem 1 To start with, let s 0 m 0. Then, m 0 ( µ i c) 2 = n m0 s0 ( µ(s) c) 2 dλ n (s) = n m0 s0 ( µ(s) c) 2 ds + o p (1), since oth m0 and s0 are of order n 1/3. Similarly for m 0 < s 0, we have s0 ( µ i c) 2 = n s0 m0 ( µ(s) c) 2 ds + o p (1). From Proposition 2, (na 2 /σ 2 ) 1/3 ( m0, s0 ) converges weakly to (X 1, X) where the limit processes are functions of the same standard Brownian Motion B(t). Therefore, s0 n ( µ(s) c) 2 ds = n m0 = s 0 m 0 s 0 m 0 σ 2 X ( µ(t) c) 2 dt [n 1 σ 2 ] 2(n 3 ( µ(t( na 2 ) 1 1 3 ) c) 3 )dt X 1 Z 2 (t)dt Similarly, the same result holds when on the set s0 m0, with their roles eing interchanged. Finally, we recall that, m 0 (ˆµ(0+) c) 2 = n 1 3 m 0 [n 1 3 (ˆµ(0+) c) ] 2 ( σ2 a 2 ) 1 3 X1 (aσ 2 ) 2 3 U 2 1 = σ 2 X 1 U 2 1 Therefore, using Lemma 2, X Q σ 2 (X 1 U 2 1 + Z 2 (t)dt) = σ 2 J X 1 where the integral changes sign when X 1 > X. 12

Acknowledgments The research is supported y NSF-SAMSI cooperative agreement No. DMS-0112069, during postdoctoral research in SAMSI. Discussions with Professor Moulinath Banerjee have een instrumental for the inception of the prolem. The author also acknowledges the contriution of Professor Michael Woodroofe as the doctoral thesis supervisor. References [1] Banerjee, M. and Wellner, J. A. (2001), Likelihood Ratio Tests for Monotone Functions, Ann. Statist. 29, 1699-1731. [2] Groeneoom, P. and Wellner, J. A. (2001), Computing Chernoff s distriution, Journal of Computational & Graphical Statistics 10, 388-400. [3] Karatzas, I. and Shreve, S. E. (1991), Brownian motion and Stochastic calculus (Springer- Verlag, New York) [4] Meyer, M. and Woodroofe, M. B. (2000), On the degrees of freedom of shape restricted regression. Ann. Statist. 28, 1083-1104. [5] Pal, J. K. (2006), Statistical analaysis and inference in shape-restricted prolems with applications to astronomy (Doctoral Dissertation, University of Michigan). [6] Roertson, T., Wright, F. T. and Dykstra, R. L. (1987), Order Restricted Statistical Inference (John Wiley, New York). [7] Woodroofe, M. B. and Sun, J. (1993), A Penalized Maimum Likelihood estimate of f(0+) when f is non-increasing, Statist. Sinica 3 501-515. [8] Wright, F. T. (1981), The asymptotic ehavior of monotone regression estimates, Ann. Statist. 9, 443-448. 13