Nonparametric Modal Regression

Size: px

Start display at page:

Download "Nonparametric Modal Regression"

Hannah Pearson
6 years ago
Views:

1 Nonparametric Modal Regression Summary In this article, we propose a new nonparametric modal regression model, which aims to estimate the mode of the conditional density of Y given predictors X. The nonparametric modal regression is distinguished from the conventional nonparametric regression in that it measures the center using the most likely conditional values rather than the conditional average. We propose to use local polynomial regression to estimate the nonparametric modal regression. The asymptotic properties of the resulting estimator are investigated. We further develop a modal expectationmaximization (EM) algorithm for the proposed model. The simulation result confirms the theoretic findings. The proposed methodology is further illustrated via the analysis of a real data example. Some key words:: Linear regression; M-estimator; Modal regression; Mode; Robust regression;. Introduction Given a random sample (x i, y i ), i =,..., n}, where x i is a p-dimension column vector, suppose f(y x) is the conditional density function of Y given x i. For the conventional nonparametric regression models, the mean of f(y x) is usually used to investigate the relationship between Y and X. When the distribution is highly skewed, it is well known that the mode provides a more meaningful location estimator than the mean. Several authors have made efforts to identify the modes of population distributions. See, for example, Silverman (98); Muller and Sawitzki (99); Scott (992); Friedman and Fisher (999); Chaudhuri and Marron (999); Fisher and

2 Marron (200); Davies and Kovac (2004); Hall, Minnotte, and Zhang (2004); Ray and Lindsay (2005); Yao and Lindsay (2009). In this article, we propose a nonparametric modal regression model that aims to estimate the mode of f(y x) for any given x. Modal regression measures the center using the most likely conditional values rather than the conditional average used by the traditional regression methods. We propose to use local polynomial regression to estimate the nonparametric modal regression. Sampling properties of the proposed estimation procedure are systematically studied. We further develop a modal expectation-maximization (EM) algorithm for the proposed model. The simulation result confirms the theoretic findings. The proposed methodology is further illustrated via the analysis of a real data example. We further develop a modal EM algorithm for the modal linear regression. The extensive Monte Carlo simulation study is conducted to assess the finite sample performance of the proposed procedure and to compare our MLR estimator with the LSE for the symmetric error density. The rest of this article is organized as follows. In Section 2, we introduce our new nonparametric modal regression model and the estimation procedure based on local polynomial regression. The asymptotic properties of the resulting estimator are also provided. In Section 3, A Monte Carlo simulation study is conducted, and a real data application is used to illustrate the proposed methodology. Some discussions are given in Section 4. Technical conditions and proofs are given in the Appendix. 2. Nonparametric Modal regression In this section, we will introduce the nonparametric modal regression model and propose to use local polynomial regression to estimate the nonparametric modal regression. An EM type algorithm is also proposed to estimate the unknown modal parameters. In addition, we will also study the asymptotic properties of the proposed 2

3 estimator. 2.. Introduction to the models and estimation procedure Suppose that (x, y ),..., (x n, y n ) are an independent and identically distributed random sample from f(x, y). The modal regression is defined as m(x) = arg max f(y x), (2.) y where m( ) is an unknown nonparametric smoothing function to be estimated. Let ϵ = y m(x). Denote by g(t x) the conditional density of ϵ given X = x. Based on the model assumption (2.), one can know that g(t x) is maximized at 0 for any x. If g(t x) is symmetric about 0, the m(x) will be the same as the conventional regression function E(Y X = x). In this article, we propose an estimation procedure for the nonparametric modal regression m(x). Since f(y x) = f(x, y)/f(x), finding the mode of f(y x) is equivalent to finding the mode of f(x, y) with x fixed. Suppose f(x, y) is estimated by the kernel density estimator, i.e. ˆf(x, y) = n K h (x i x)ϕ h2 (y i y), where K h (x) = h K(x/h) and ϕ h (t) = h ϕ(t/h) are the symmetric kernel functions and (h, h 2 ) are the bandwidths. A natural estimation procedure is to estimate m(x 0 ) by ˆm(x 0 ) = arg max y n K h (x i x 0 )ϕ h2 (y i y). (2.2) Note that the modal regression (2.2) only uses one term (the intercept) for the conditional mode, like the local constant estimator (Nadaraya, 964; Watson, 964). It is known that the local linear estimator is superior to the local constant one (Fan 3

4 and Gijbels, 996). So we may want to extend the idea of local constant modal regression (2.2) to the local linear case, or more generally to the local polynomial case. For x in a neighborhood of x 0, we approximate m(x) p j=0 m (j) (x 0 ) (x x 0 ) j j! p β j (x x 0 ) j, j=0 where β j = m (j) (x 0 )/j!. Our local polynomial modal regression (LPMR) estimation procedure is to maximize over θ = (β 0,..., β p ) l(θ) n ) p K h (x i x 0 )ϕ h2 (y i β j (x i x 0 ) j. (2.3) j=0 For ease of computation, we use the standard normal density for ϕ(t) throughout this proposal. See (2.4) below. Denote the maximizer of l(θ) to be ˆθ = ( ˆβ 0,, ˆβ p ). Then the estimator of the v-th derivative of m(x), m (v) (x), will be ˆm (v) (x 0 ) = v! ˆβ v, for v = 0,, p. Specifically, when p = and v = 0, we refer to this method as local linear modal regression (LLMR) Modal expectation-maximization algorithm Note that (2.3) does not have an explicit solution. In this section, we propose to extend the modal expectation-maximization (MEM) algorithm, proposed by Li et al. (2007), to maximize (2.3). Similar to an EM algorithm, the MEM algorithm consists of two steps: E-step and M-step. Let θ (0) be the initial value. Starting with k = 0: E-Step: Update π(j θ (k) ) π(j θ (k) ) = K h (x j x 0 )ϕ h2 y j } p l=0 β(k) l (x j x 0 ) l [K h (x i x 0 )ϕ h2 y i }], j =,..., n. p l=0 β(k) l (x i x 0 ) l 4

5 M-Step: Update θ (k+) θ (k+) = arg max θ [π(j θ (k) ) log ϕ h2 y j j= }] p β l (x j x 0 ) l = (X T W k X) X T W k Y (2.4) l=0 since ϕ( ) is the density function of a standard normal distribution. Here X = (x,..., x n) T with x i =, x i x 0,, (x i x 0 ) p } T, W k is an n n diagonal matrix with diagonal elements π(j θ (k) )s, and Y = (y,..., y n ) T. The MEM algorithm is to iterate the E-step and the M-step until the algorithm converges. The ascending property of the proposed MEM algorithm can be established along the lines of Li et al. (2007). Note that the converged value may depend on the starting-point as the usual EM algorithm, and there is no guarantee that the algorithm will converge to the global optimal solution. Therefore, it is prudent to run the algorithm from several starting-points and choose the best local optima found. The closed form solution for θ (k+) is one of the benefits of using normal density function ϕ h ( ) in (2.3). From the MEM algorithm, it can be seen that the major difference between the conventional local polynomial regression (LPR) and LPMR lies in the E-step. The contribution of observation (x i, y i ) to the LPR depends on the weight K h (x i x 0 ), which in turn depends on how close x i is to x 0 only. On the other hand, the weight in the LPMR depends on both how close x i is to x 0 and how close y i is to the regression curve. This weight scheme allows the LPMR to downweight the observations further away from the regression curve to achieve adaptive robustness Theoretical properties We first establish the convergent rate of the LPMR estimator in the following theorem, whose proof can be found in the Appendix. 5

6 Theorem 2.. Under the regularity conditions (A) (A5) in the Appendix, there exists a consistent local maximizer ˆθ of (2.3) such that h v ˆmv (x 0 ) m (v) (x 0 ) } = Op (nh h 3 2) /2 + h p+ + h 2 2}, v = 0,,..., p, where ˆm v (x 0 ) = v! ˆβ v is the estimate of m (v) (x 0 ) and m (v) (x 0 ) is the v th derivative of m(x) at x 0. The proof of Theorem 2. is given in the Appendix. To derive the asymptotic bias and variance of the LPMR estimator, we need the following notation. The moments of K and K 2 are denoted respectively by µ j = t j K(t)dt and ν j = t j K 2 (t)dt. Let S, S, and S be (p + ) (p + ) matrix with (j, l)-element µ j+l 2, µ j+l, and ν j+l 2, respectively, and c p, c p, and c p be p vector with j-th element µ p+j, µ p+j+, and µ j, respectively. Furthermore, let e v+ = (0,..., 0,, 0,..., 0) T, a p vector with on the (v + ) th position. Theorem 2.2. Under the regularity conditions (A) (A5) in the Appendix, the asymptotic variance of ˆm v (x 0 ), given in Theorem 2., is given by Var ˆm v (x 0 )} = e T v+s S S v! 2 g(0 x 0 ) ν e v+ f(x 0 )g (0 x 0 ) 2 nh +2v + o h 3 p ()}. (2.5) 2 where ν = t 2 ϕ 2 (t)dt. The asymptotic bias of ˆm v (x 0 ), denoted by b v (x 0 ), for p v odd is given by b v (x 0 ) = e T v+s h p+ v v! (p + )! m(p+) (x 0 )c p g (0 x 0 )v!h 2 2 2g (0 x 0 )h v 6 c p } + o()} (2.6)

7 Furthermore, the asymptotic bias for p v even is [ b v (x 0 ) =e T v+s h p+2 v } v! c p m (p+2) (x 0 ) + (p + 2)m (p+) (x 0 ) Γ (x 0 ) (p + 2)! Γ(x 0 ) ] g (0 x 0 )v!h 2 2 c 2g (0 x 0 )h v p + o()}, (2.7) provided that m (p+2) ( ) are continuous in a neighborhood of x 0 and nh 3 h 5 2. The proof of Theorem 2.2 is given in the Appendix. Similar to the LPR, the second term in (2.7) often creates extra bias. Thus, it is preferable to use odd values of p v in practice. Therefore, it is consistent with the selection order of p for the LPR (?). Theorem 2.3. Under the regularity conditions (A) (A5) in the Appendix, the estimate ˆm v (x 0 ), given in Theorem 2., has the following asymptotic distribution ˆm v (x 0 ) m (v) (x 0 ) b v (x 0 ) Var ˆmv (x 0 )} D N(0, ). The proof of Theorem 2.3 is given in the Appendix. The optimal bandwidth h = O p ( 2 7p + 9 ), h 2 = O p ( p + 7p + 9 ). Specially, when p = and v = 0, the asymptotic variance of ˆm(x 0 ) is Var ˆm(x 0 )} g(0 x 0 ) νν 0 nh h 3 2g (0 x 0 ) 2 f(x 0 ) + o p()} (2.8) and the asymptotic bias is b(x 0 ) 2 m (x 0 )µ 2 h 2 g (0 x 0 )h 2 2 2g (0 x 0 ) (2.9) 7

8 The optimal bandwidth h = O p ( /8), h 2 = O p ( /8) Bandwidth selection 3. Simulation Study and Application In this section, we will conduct Monte Carlo simulation to assess the performance of our proposed LPMR. We also demonstrate the LPMR using one real data application. 3.. Simulation study 3.2. An application Appendix APPENDIX: PROOFS The following technical conditions are imposed in this section. They are not the weakest possible conditions, but they are imposed to facilitate the proofs. Technical Conditions: (A) The m(x) has continuous (p + ) th derivative at the point x 0. (A2) The f(x) is bounded and has continuous first derivative at the point x 0 and f(x 0 ) > 0. (A3) g (0 x) = 0, g (0 x) < 0, g (v) (t x) is bounded in a neighbor of x 0 and has continuous first derivative at the point x 0 as a function of x, for v = 0,..., 4. 8

9 (A4) K( ) is a symmetric (about 0) probability density with compact support [, ]. (A5) The bandwidths h and h 2 tend to 0 such that nh h 5 2 and h p+ /h 2 0. Denote X i =, (X i x 0 )/h,..., (X i x 0 ) p /h p } T, H = diag, h,..., h p }, θ = (β 0, β,..., β p ) T, θ = Hθ, R(X i ) = m(x i ) p j=0 β j(x i x 0 ) j, and K i = K h (x i x 0 ), where β j = m (j) (x 0 )/j!, j = 0,,..., p. Proof of Theorem 2.. Denote α n = (nh h 3 2) /2 + h p+ + h 2 2. It is sufficient to show that for any given η > 0, there exists a large constant c such that P sup l(θ + α n µ) < l(θ )} η, µ =c (A.) where l(θ) is defined in (2.3). By using Taylor expansion, it follows that l(θ + α n µ) l(θ ) = n = n K i ϕh2 (ϵ i + R(X i ) α n µ T Xi ) ϕ h2 (ϵ i + R(X i )) } K i ϕ h 2 (ϵ i + R(X i ))α n µ T X i + 2 ϕ h 2 (ϵ i + R(X i ))α 2 n(µ T X i ) 2 } 6 ϕ h 2 (z i )αn(µ 3 T Xi ) 3 ) I + I 2 + I 3, (A.2) where z i is between ϵ i + R(X i ) and ϵ i + R(X i ) + α n µ T X i. Note that ϕ h 2 (t) = t ( ) t ϕ, ϕ h 3 h 2 h 2 (t) = ( ) ( ) t 2 t ϕ, and ϕ 2 h 3 2 h 2 h 2 h 2 (t) = ( ) } 3 ( ) 3t t t ϕ. 2 h 4 2 h 2 h 2 h 2 9

10 If a n (x) = o p (h 2 ), and g (v) (t x) is bounded in a neighbor of x 0, we have Eϕ h 2 (ϵ + a n (x)) X = x} = h 2 tϕ(t)g(th + a n (x) x)dt = h 2 tϕ(t)g(th 2 x) + g (th 2 x)a n (x) + g (z x)a 2 n(x)}dt } g (0 x) = h g (0 x)a n (x) + o p ()}. 2 If h p+ /h 2 0, by directly calculating the mean and variance, we obtain E(I ) = α n E [ ] K i Eϕ h 2 (ϵ i + R(X i )) X i }µ T Xi ( + o()) g = α n µ T (0 x 0 ) 2 = O α n c ( )} h h p+ ; f(x 0 ) c p h 2 2 g (0 x 0 )c p f(x 0 ) m(p+) (x 0 ) (p + )! Var(I ) = n αnvar[k 2 i ϕ h 2 ϵ i + R(X i )}(µ T Xi )] = n αnµ } 2 T g(0 x 0 )f(x 0 )ν 0 S h 3 2 h µ } h p+ = O(α 2 n(nh h 3 2) c 2 ), (A.3) where c p = (µ 0, µ,..., µ p ) T. Hence I = O α n c ( )} h h p+ + αn co p ((nh h 3 2 ) /2 ) = O p (cαn). 2 If nh 5 2h, similar to (A.3), we can prove I 2 = n I 3 = n } 2 K iϕ h 2 (ϵ i + R(X i ))αnµ 2 T Xi Xi T µ = αng 2 (0 x 0 )f(x 0 )µ T Sµ( + o p ()), } 6 K iϕ h 2 (z i )αn(µ 3 T Xi ) 3 = o p (αn). 2 (A.4) 0

11 Noticing that S is a positive matrix, µ = c, and g (0 x 0 ) < 0, we can choose c large enough such that I 2 dominates both I and I 3 with probability at least η. Thus (A.) holds. Therefore, with probability approaching (wpa), there exists a local maximizer ˆθ such that ˆθ θ α n c. Based on the definition of θ, we can get, wpa, h v ˆmv (x 0 ) m (v) (x 0 ) } = Op (nh h 3 2) /2 + h p+ + h 2 2}. Define W n = Xi K i ϕ h 2 (ϵ i ). We have the following asymptotic representation. (A.5) Lemma A.. Under conditions (A) (A6), it follows that ˆθ θ = h p+ m (p+) (x 0 ) S c p ( + o p ()) + (p + )! S W n nf (x 0, h 2 )f(x 0 ) ( + o p()). (A.6) Proof. Let p ˆγ i = R(X i ) ( ˆβ j β j )(X i x 0 ) j = R(X i ) (ˆθ θ ) T X. j=0 Then p Y i ˆβ j (X i x 0 ) j = ϵ i + ˆγ i. j=0 The solution ˆθ satisfies the equation Xi K i ϕ h 2 (ϵ i + ˆγ i ) = X i K i ϕ h 2 (ϵ i ) + ϕ h 2 (ϵ i )ˆγ i + 2 ϕ h 2 (ϵ )ˆγ 2 } = 0, (A.7) where ϵ is between ϵ i and ϵ i + ˆγ i. Note that the second term on the left hand side of

12 (A.7) is K i ϕ h 2 (ϵ i )R(X i )Xi K i ϕ h 2 (ϵ i )Xi Xi (ˆθ θ ) J + J 2. (A.8) From the proof of (A.3), we have J = nh p+ g m (p+) (x 0 ) (0 x 0 )f(x 0 )c p + o p (nh p+ ), (p + )! and J 2 = ng (0 x 0 )f(x 0 )S( + o p ())(ˆθ θ ). From the Theorem 2., we know ˆθ θ = O p h p+ + h (nh h 3 2) /2 }, hence sup ˆγ i sup R(X i ) + (ˆθ θ ) T X i: X i x 0 /h i: X i x 0 /h = O p (h p+ + ˆθ θ ) = O p ( ˆθ θ ). (A.9) (m (p+) (x) is bounded) Also, we have E ( ) ( ) j x K i (X i x 0 ) j /h} j x0 x x0 = K f(x)dx h h h = µ j f(x 0 ) + o(). (A.0) Based on (A.4), (A.9), and ˆθ θ = O p (α n ) } } K iˆγ i 2 ϕ h n 2 (ϵ )(X i x 0 ) j /h j = K i α n nϕ 2 h 2 (ϵ )(X i x 0 ) j /h j = o p ( α n ) 2

13 Hence for the third term on the left-hand side of (A.7), K iˆγ 2 i X i ϕ h 2 (ϵ ) = o p (nα n ) = o p (J 2 ). Then, it follows from (A.5) and (A.7) that ˆθ θ = ng (0 x 0 )f(x 0 )S} nh p+ g m (p+) (x 0 ) (0 x 0 )f(x 0 )c p ( + o p ()) (p + )! + ng (0 x 0 )f(x 0 )S} W n ( + o p ()), =h p+ m (p+) (x 0 ) S c p ( + o p ()) + (p + )! S W n ng (0 x 0 )f(x 0 ) ( + o p()). (A.) Proof of Theorem 2.2. Based on (A.5) and the condition (A6), we can easily get ( ) E n W n = E Xi K i ϕ h 2 (ϵ i ) } = g (0 x 0 ) f(x 0 )c 2 ph o p (h 2 2). Similar to the proof of A.3, we have E K 2 i ϕ h 2 (ϵ i ) 2 (X i x 0 ) j /h j } = (h h 3 2) ν j νg(0 x 0 )f(x 0 ) + o()}. where ν = ϕ 2 (t)t 2 dt. So Cov(W n /n) = (nh h 3 2) νg(0 x 0 )f(x 0 )S ( + o()). (A.2) Based on the result (A.6), the asymptotic bias b v (x 0 ) and variance of ˆm v (x 0 ) are naturally given by b v (x 0 ) = e T v+s h p+ v v! (p + )! m(p+) (x 0 )c p g (0 x 0 )v!h 2 2 2g (0 x 0 )h v c p } ( + o()) 3

14 and Var ˆm v (x 0 )} = v! 2 g(0 x 0 ) ν nh 3 2h +2v f(x 0 )g (0 x 0 ) 2 et v+s S S e v+ ( + o()). Noting that µ j = 0 for odd j, by some simple calculation, we can know the (v + ) th element of S c p is zero for p v even. So we need higher order expansion of asymptotic bias for p v even. Following the similar arguments of Theorem 2., if nh 3 h 5 2 (make the root of variance order less than bias order), we can prove [ J = nh p+ m (p+) (x 0 ) Γ(x 0 )c p (p + )! +h c p Γ (x 0 ) m(p+) (x 0 ) (p + )! J 2 = n Γ(x 0 )S + h SΓ (x 0 ) + Γ(x 0 ) m(p+2) (x 0 ) (p + 2)! } + o p ()}(ˆθ θ ), }] + o p ()}, where J and J 2 is defined in (A.8) and Γ(x) = g (0 x)f(x). Then, it follows from (A.7) that } [ ˆθ θ = h p+ Γ(x 0 )S + h SΓ m (p+) (x 0 ) (x 0 ) Γ(x 0 )c p (p + )! +h c p Γ (x 0 ) m(p+) (x 0 ) (p + )! + Γ(x 0 ) m(p+2) (x 0 ) (p + 2)! S W n + ng (0 x 0 )f(x 0 ) ( + o p()) m = h p+ (p+) (x 0 ) S c p + h b (x 0 ) (p + )! } }] + o p ()} ( + o p ()) + S W n nγ(x 0 ) ( + o p()), 4

15 where b (x 0 ) =Γ (x 0 )S c } p Γ (x 0 ) m(p+) (x 0 ) + Γ(x 0 ) m(p+2) (x 0 ) (p + )! (p + 2)! Γ (x 0 )Γ (x 0 ) m(p+) (x 0 ) S SS c p. (p + )! For p v even, since the (v + ) th element of S c p and S SS c p is zero, the asymptotic bias b v (x 0 ) of ˆm v (x 0 ) are naturally given by [ b v (x 0 ) =e T v+s h p+2 v v! c p (p + 2)! ] g (0 x 0 )v!h 2 2 c 2g (0 x 0 )h v p m (p+2) (x 0 ) + (p + 2)m (p+) (x 0 ) Γ (x 0 ) Γ(x 0 ) + o()}. } Proof of Theorem 2.3. It is sufficient to show that W n h h 3 2/nW n D N(0, D) (A.3) where D = νg(0 x 0 )f(x 0 )S, because using Slutsky s theorem, it follows from (A.6), (A.3), and Theorem 2.2 that ˆm v (x 0 ) m (v) (x 0 ) b v (x 0 ) Var ˆmv (x 0 )} D N(0, ). Next we show (A.3). For any unit vector d R p+, we prove d T Cov(W n)d} 2 d T W n d T E(W n)} D N(0, ). 5

16 Let ξ i = h h 3 2/nK i ϕ h 2 (ϵ i )d T X i. Then d T W n = n ξ i. We check the Lyapunov s condition. Based on (A.2), we can get Cov(W n) = νg(0 x 0 )f(x 0 )S ( + o()) and Var(d T W nd) = d T Cov(W n)d = νg(0 x 0 )f(x 0 )d T S d( + o()). So we only need to prove ne ξ 3 0. Noticing that (d X i ) 2 d 2 X i 2, ϕ ( ) is bounded, and K( ) has compact support, ne ξ 3 O(nn 3/2 h 3/2 h 9/2 2 ) = O(n /2 h 3/2 h 9/2 2 ) p ( ) 3j E K3 ϕ h 2 (ϵ ) 3 X x 0 j=0 p j=0 h O(h 2 h 5 2 ) = O(nh h 2 ) /2 } 0 So the asymptotic normality for W n holds with covariance matrix νg(0 x 0 )f(x 0 )S. 6

17 References Chatterjee, S. and Price, B. (977). Regression Analysis by Example. John Wiley and Sons, New York. Chaudhuri, P. and Marron, J. S. (999). Sizer for Exploration of Structures in Curves. Journal of the American Statistical Association, 94, Davies, P. L. and Kovac, A. (2004). Densities, Spectral Densities and Modality. Annals of Statistics, 32, Donoho, D.L. (982). Breakdown Properties of Multivariate Location Estimators. Ph.D. Qualifying paper, Department of Statistics, Harvard University. Donoho, D.L. and Huber, P.J. (983). The Notion of Breakdown Point. In: Doksum, K., Hodges, J.L. (Editors), Festchrift in Honor of Erich Lehman. Wadsworth, Belmont, CA. Fan, J. and Gijbels, I. (996). Local Polynomial Modelling and Its Applications. Chapman and Hall, London. Fan, J. and Jiang, J. (2000). Variable Bandwidth and One-Step Local M-Estimator. Science in China, 43, Fisher, N. I. and Marron, J. S. (200). Mode Testing via The Excess Mass Estimate. Biometrika, 88, Friedman J. H. and Fisher, N. I. (999). Bump Hunting in High-Dimensional Data. Statistics and Computing, 9, Hall, P., Minnotte, M. C., and Zhang, C. (2004). Bump Hunting with Non-Gaussian Kernels. Annals of Statistics, 32,

18 Hample, F. R. (97). A General Quantitative Definition of Robustness. Annals of Mathematiccal Statistics, 42, Hample, F. R. (974). The Influence Curve and Its Role in Robust Estimation. Journal of the American Statistical Association, 69, Hample, F. R. (986). Robust Statistics. Wiley, New York. Huber, P. J. (98). Robust Statistics. Wiley, New York. Li, J., Ray, S., and Lindsay, B. G. (2007). A Nonparametric Statistical Approach to Clustering via Mode Identification. Journal of Machine Learning Research, 8(8), Muller, D. W. and Sawitzki, G. (99). Excess Mass Estimates and Tests for Multimodality. Journal of the American Statistical Association, 86, Nadaraya, E. A. (964). On estimating regression. Theory of Probability Applied, 0, Parzen, E. (962). On estimation of a probability density function and mode. Annals of Mathematical Statistics, 33, Ray, S. and Lindsay, B. G. (2005). The Topography of Multivariate Normal Mixtures. Annals of Statistics, 33, Silverman, B. W. (98). Using Kernel Density Estimates to Investigate Multimodality. Journal of Royal Statistical Society, B43, Silverman, B. W. (986). Density Estimation for Statistics and Data Analysis. Chapman and Hall: London. 8

19 Scott, D. W. (992). Multivariate Density Estimation: Theory, Practice and Visualization. New York: Wiley. Watson, G. S. (964). Smooth regression analysis. Sankhya, Ser. A, 26, Yao, W. and Lindsay, B. G. (2009). Bayesian Mixture Labelling by Highest Posterior Density. Journal of American Statistical Association, 04,

Local Modal Regression

Local Modal Regression Weixin Yao Department of Statistics, Kansas State University, Manhattan, Kansas 66506, U.S.A. wxyao@ksu.edu Bruce G. Lindsay and Runze Li Department of Statistics, The Pennsylvania