A New Regression Model: Modal Linear Regression

Size: px

Start display at page:

Download "A New Regression Model: Modal Linear Regression"

Antony Lester
6 years ago
Views:

1 A New Regression Model: Modal Linear Regression Weixin Yao and Longhai Li Abstract The population mode has been considered as an important feature of data and it is traditionally estimated based on some nonparametric kernel density estimator. However, little research has been done to apply the mode concept if certain regression structure is imposed. This article develops a new data analysis tool, called modal linear regression, to explore the high-dimension data. The modal linear regression assumes the mode of the conditional density of Y given predictors x is a linear function of x. It is distinguished from the conventional linear regression in that it measures the center using the most likely conditional values rather than the conditional average. We will argue that the modal regression, besides its robustness, gives a more meaningful point prediction and larger coverage probability for prediction than the mean regression when the error density is skewed if the same length of short intervals, centered around each estimate, are used. We propose an EM type algorithm for the proposed modal liner regression and provide the systematic asymptotic properties under the general error density assumption, such as skewed error density. Application to simulated data and real data demonstrates that the proposed modal regression has better predictive performance than the mean regression. 1 Weixin Yao is Assistant Professor, Department of Statistics, Kansas State University, Manhattan, Kansas 66506, U.S.A. wxyao@ksu.edu. Longhai Li is Assistant Professor, Department of Mathematics and Statistics, University of Saskatchewan, Saskatoon, Saskatchewan, S7N5E6, Canada. longhai@math.usask.ca. The research of Longhai Li is supported by fundings from Natural Sciences and Engineering Research Council of Canada, and Canadian Foundation of Innovations. 1

2 Key Words: Forest fires data; Linear regression; Modal regression; Mode. 1. Introduction The mode of a distribution is regarded as an important feature of data. Several authors have made efforts to identify the modes of population distributions for low-dimensional data. See, for example, Muller and Sawitzki (1991); Scott (1992); Friedman and Fisher (1999); Chaudhuri and Marron (1999); Fisher and Marron (2001); Davies and Kovac (2004); Hall, Minnotte, and Zhang (2004); Ray and Lindsay (2005); Yao and Lindsay (2009). (Please see the command npconmode in R package np for the nonparametric mode estimation.) In high-dimensional data, it is common to impose some model structure assumption such as assumption on conditional distributions. Thus, it is of great interest to study the mode hunting for conditional distributions. Given a random sample {(x i, y i ), i = 1,..., n}, where x i is a p-dimension column vector, suppose f(y x) is the conditional density function of Y given x. For the conventional regression models, the mean of f(y x) is usually used to investigate the relationship between Y and x and the linear regression assumes that the mean of f(y x) is a linear function of x. In this article, we propose a new regression model called modal linear regression that assumes the mode of f(y x) is a linear function of the predictor x. Modal linear regression measures the center using the most likely conditional values rather than the conditional average used by the traditional linear regression. Compared to the traditional mean regression, our proposed modal regression has the following main advantages. 1. When the error density is highly skewed, the modal regression provides a more meaningful location estimator/point prediction than the mean regression. 2. Since the modal regression focuses on the highest conditional density region, a short interval around the modal regression estimate is expected to have larger coverage prob- 2

3 ability when doing prediction than the same length of interval around the mean regression estimate. 3. The modal regression can be used to estimate the relationship between variables for majority of the data when a small proportion of data are contaminated or don t follow the same relationship as the majority data. 4. Due to the robustness of the mode, the modal linear regression is also robust to the outliers and the heavy-tailed error density. 5. The modal regression has several advantages over the traditional robust regression. The traditional robust regression still targets the mean regression function and requires the assumption of symmetric error density; it doesn t have a good model interpretation if the error density is not symmetric. However, the modal regression does not require the error density to be symmetric and has the nice modal explanation for general error density. Therefore, the modal regression provides us another powerful tool to explore the high-dimension data and it can be easily extended to some other regression models, such as nonlinear regression, nonparametric regression, and varying coefficient partially linear regression. Quantile regression is another alternative data analysis tool when the data is skewed. However, quantile regression cannot reveal the modal information and might produce the low density point prediction. Therefore, the modal regression is a potentially very useful addition to the current data analysis tool. We propose a kernel based objective function to estimate the modal regression line. We further develop a modal EM algorithm to maximize the proposed objective function. The convergence rate and sampling properties of the proposed estimation procedure are systematically studied. A Monte Carlo simulation study is also conducted to assess the finite sample performance of the proposed procedure. The 3

4 application of the proposed modal regression to predict forest fires using meteorological data shows that the proposed method has better predictive performance. We propose a method for the proposed modal regression to use the information of error density to construct the skewed prediction interval. The leave-one-out cross validation shows that modal regression provides much shorter prediction intervals than the mean regression for forest fires data. The rest of this article is organized as follows. In Section 2, we introduce our new modal linear regression, develop the modal EM algorithm, and study the asymptotic properties of the resulting estimate. In Section 3, a Monte Carlo simulation study is conducted, and a real data application is used to demonstrate the proposed methodology. Some discussions are given in Section 4. Proofs of consistency of our estimators and the necessary technical conditions are given in the Appendix. 2. Modal linear regression 2.1. Introduction to modal linear regression Suppose a response variable y given a set of predictor x is distributed with a PDF f(y x). In this article, we propose to use the mode of f(y x), denoted by Mode(Y x) = arg max y (f(y x)), to investigate the relationship between Y and x. The proposed modal linear regression method assumes that Mode(Y x) = x T β. (2.1) We will discuss the extension of modal linear regression in (2.1) to some other models in Section 4. To include the intercept term in (2.1), we assume that the first element of x is 1. Let ɛ = y x T β and denote by g(ɛ x) the conditional density of ɛ given x. Here, we allow the conditional density of ɛ given x to depend on x. Based on the model assumption (2.1), one knows that g(ɛ x) is maximized at 0 for any x. If g(ɛ x) is symmetric about 0, 4

5 the β in (2.1) will be the same as the conventional linear regression parameters. However, if g(ɛ x) is skewed, they will be different and it is even possible that the modal regression is a linear function of x but the conventional mean regression function is nonlinear. Let us use an example to illustrate the difference between modal regression function and conventional mean regression function when the error is skewed. Example 1: Let (x, Y ) satisfy the following model assumption Y = m(x) + σ(x)ε, (2.2) where ε has density h( ). Suppose h( ) is a skewed density with mean 0 and mode 1. a) If m(x) = x T β and σ(x) = x T α, then E(Y x) = x T β and Mode(Y x) = x T (β + α), where Mode(Y x) means the conditional mode of Y given x. Therefore, Y depends on x linearly from the point of view of both mean regression and modal regression, although their regression parameters are different. b) If m(x) = 0 and σ(x) = x T α, then E(Y x) = 0 and Mode(Y x) = x T α. Therefore, from the point of view of conventional mean regression, Y does not depend on x, but based on modal regression, Y does depend on x through the modal line. From this example, we can see that the variable selection techniques based on modal regression might reveal some new useful predictors compared to the mean regression. 5

6 We propose to estimate the modal regression parameter β in (2.1) by maximizing Q h (β) 1 n φ h (y i x T i β), (2.3) where φ h (t) = h 1 φ(t/h) and φ(t) is a kernel density function. Denote by ˆβ the maximizer of (2.3). We call ˆβ the modal linear regression (MODLR) estimator. For ease of computation, we use the standard normal density function for φ(t) throughout this paper. See (2.6) below. To see why (2.3) can be used to estimate the modal regression, taking β = β 0, the intercept term only, then (2.3) is simplified to Q h (β 0 ) 1 n φ h (y i β 0 ). (2.4) As a function of β 0, Q h (.) is the kernel estimate of the density function of y. Therefore, the maximizer of (2.4) is the mode of the kernel density function based on y 1,... y n. When n, h 0, the mode of kernel density function will converge to the mode of the distribution of y. Such modal estimator has been proposed by Parzen (1962). Here, we use this kernel type objective function to estimate the modal regression parameter β. We will prove, in Section 2.3, that the ˆβ found by maximizing (2.3) is a consistent estimate of modal regression parameter in (2.1) under very general error density, such as skewed error density, if h 0 when n. Note that for any fixed β, Q h (β) in (2.3) is the value of kernel density function of the residuals ɛ i = y i x i β at point 0. Therefore, by maximizing (2.3) with respect to β, we will find a line xˆβ such that the value of the kernel density function of residuals at 0 is maximized. If φ h (t) = (2h) 1 I( t h), a uniform kernel, is used, for φ( ), then (2.3) tries to find the line x T ˆβ such that the band x T ˆβ ± h consists of the largest number of response values y i. Therefore, the modal regression gives more meaningful prediction than the mean 6

7 regression when the data is skewed. Lee (1989) used such uniform kernel to estimate the modal regression. However, in Lee (1989), h is fixed and does not depend on the sample size n. Therefore, they require the error to be symmetric to get the consistent modal line (note that in such case the modal line is the same as the traditional mean regression line). Thus, their modal regression estimator is essentially a robust regression estimate under the assumption of symmetric error density Modal expectation-maximization algorithm Note that (2.3) does not have an explicit solution. In this section, we propose to extend the modal expectation-maximization (MEM) algorithm, proposed by Li, Ray, and Lindsay (2007), to maximize (2.3). Similar to an EM algorithm, the MEM algorithm consists of two steps: E-step and M-step. Let β (0) be the initial value. Starting with k = 0, repeat the following two steps until it converges: E-Step: In this step, we calculate weights π(j β (k) ), j = 1,..., n as π(j β (k) ) = φ h (y j x T j β (k) ) n φ h(y i x T i β(k) ) φ h(y j x T j β (k) ). (2.5) M-Step: In this step, we update β (k+1) β (k+1) = arg max β j=1 { } π(j β (k) ) log φ h (y j x T j β) = (X T W k X) 1 X T W k y, (2.6) where X = (x 1,..., x n ) T, W k is an n n diagonal matrix with diagonal elements π(j β (k) )s, and y = (y 1,..., y n ) T. Note that the function optimized in M-step is just a weighted sum of log likelihoods of ordinary linear regression, therefore the explicit maximizer is possible. The following 7

8 theorem establishes the ascending property of the proposed MEM algorithm and its proof is given in the Appendix. Theorem 2.1. Each iteration of (2.5) and (2.6) will monotonically non-decrease the objective function (2.3), i.e., Q h (β (k+1) ) Q h (β (k) ), for all k. Note that the converged value may depend on the starting point as usual EM algorithms, and there is no guarantee that the algorithm will converge to the global optimal solution of (2.3). Therefore, it is prudent to run the algorithm from several starting-points and choose the best local optima found. From the MEM algorithm, it can be seen that the major difference between the mean regression estimate by the least squares (LSE) and the modal regression estimate (MODLR) lies in the E step. For the LSE, each observation has equal weight. On the other hand, for MODLR, the weight depends on how close y i is to the modal regression line. This weight scheme allows MODLR to reduce the effect of observations far away from the modal regression line to achieve robustness. The iteratively reweighted least squares (IRWLS) algorithm has been commonly used for general M-estimators. Since our modal regression can be considered as a special case of M- estimators, IRWLS algorithm can be also applied to our proposed modal regression. When normal kernel is used for φ( ), the IRWLS algorithm is actually equivalent to the proposed MEM algorithm, but they are different if φ( ) is not normal. Note that IRWLS has been proved to have monotone and convergence property if φ(x)/x is nonincreasing. But the proposed MEM algorithm has monotone property for any kernel density φ( ). In addition, note that φ(x)/x is not nonincreasing if φ(x) is a normal density function. Therefore, the proposed MEM algorithm provides better explanation why the IRWLS is monotone for normal kernel density Theoretical properties 8

9 Note that the asymptotic properties established for traditional M-estimator require the error density to be symmetric and the objective function to be fixed, since traditional M-estimators still focus on mean regression. For our proposed modal regression (2.3), the tuning parameter h in the objective function goes to zero and the error density can be skewed. Therefore, the theoretical results on the traditional M estimators can not be directly applied to modal regression. In this section, we prove the consistency of our proposed modal regression estimator ˆβ for model (2.1) and provide its convergence rate as well as its asymptotic distribution. Their proofs are given in the Appendix. Theorem 2.2. When h 0 and nh 5, under the regularity conditions (A1) (A3) in the Appendix, there exists a consistent maximizer of (2.3) such that ˆβ β 0 = O p {h 2 + (nh 3 ) 1/2 }, where β 0 is the true coefficient of the modal regression line defined in (2.1). Theorem 2.3. Under the same assumptions as Theorem 2.2, the ˆβ, given in Theorem 2.2, has the following asymptotic normality result nh 3 ] [ˆβ β 0 h2 2 J 1 D K{1 + o p (1)} N { 0, ν 2 J 1 LJ 1}, (2.7) where ν 2 = t 2 φ 2 (t)dt and J = E{g (0 x i )x i x T i }; K = E{g (0 x i )x i }; L = E{g(0 x i )x i x T i }. Paren (1962) and Eddy (1980) have proved similar asymptotic results for kernel estimators of the mode of the distribution of y without considering the regression effect. Therefore, 9

10 the results of Paren (1962) and Eddy (1980) can be considered as a special case of Theorem 2.3 when there is no predictor involved, i.e., x = 1. From Theorem 2.3, we can see that the asymptotic bias of ˆβ is h 2 J 1 K/2 and the asymptotic variance is ν 2 J 1 LJ 1 /(nh 3 ). A theoretic optimal bandwidth h for estimating β can be obtained by minimizing the asymptotic weighted mean squared errors (MSE) { } E (ˆβ β 0 ) T W (ˆβ β 0 ) K T J 1 W J 1 Kh 4 /4 + (nh 3 ) 1 tr ( J 1 LJ 1 W ), where W is a weight function, such as an identity matrix, reflecting which coefficient is more important in inference, and tr(a) is the trace of A. Therefore, the asymptotic optimal bandwidth h is If W = (J 1 LJ 1 ) 1 variance of ˆβ, then ĥ opt = [ ] 3ν2 tr (J 1 LJ 1 1/7 W ) n 1/7. (2.8) K T J 1 W J 1 K = JL 1 J, which is proportional to the inverse of the asymptotic ĥ opt = [ ] 1/7 3ν2 (p + 1) n 1/7. (2.9) K T L 1 K Let β = (β 0, β s ) T, where β 0 is a scalar intercept parameter and β s is the slope parameter. If ɛ is independent of x, then J 1 K = (1, 0,..., 0) T g (0)/{2g (0)}, and thus the asymptotic bias of the slope parameter β s is 0. Therefore, the optimal bandwidth h for estimating β s should go to infinity, which implies that the resulting estimate ˆβ s is a least square estimate with root n consistency. This is expected since when ɛ is independent of x, the slope parameter β s of modal regression line is the same as the slope parameter of conventional mean regression line and thus can be estimated at root n convergence rate. 10

11 Given the root n consistent estimate ˆβ s (using LSE, for example), we propose to further estimate β 0 by 1 ˆβ 0 = arg max β 0 n φ h (y i x T i ˆβ s β 0 ). (2.10) The above maximization can be done similarly using the MEM algorithm proposed in Section 2.2. Denote by ˆβ 0 the maximizer of (2.10), we have the following result. Its proof is given in the Appendix. Theorem 2.4. Under the same assumption as Theorem 2.2, if ɛ is independent of x and g (0) 0, the ˆβ 0 defined in (2.10) has the following asymptotic distribution: { } { nh 3 ˆβ 0 β 0 g (0)h 2 2g (0) + o p(h 2 D ) N 0, g(0)ν } 2 [g (0)] 2. (2.11) Note that when ɛ is independent of x, J 1 LJ 1 = g (0) 2 g(0)e(xx T ) 1. Let A = E(xx T ) = 1 A 12 A T 12 A 22 and a 11 be the (1,1) element of A 1. Noting that a 11 = (1 A 12 A 1 22 A T 12) 1 and A 22 is positive definite, we have a Therefore, based on Theorem 2.3 and 2.4, we can see that using the root n consistent estimate ˆβ s as initial, we can get more efficient estimate of the intercept β 0 than the one found by maximizing (2.3) directly. This is reasonable since the estimate ˆβ 0 in (2.10) need not account for the uncertainty of ˆβ s due to its root n consistency and thus ˆβ 0 is asymptotically as efficient as if β s were known. From Theorem 2.4, we can see that the asymptotic bias of ˆβ 0 is {2g (0)} 1 g (0)h 2 and its asymptotic variance is [{g (0)} 2 nh 3 ] 1 g(0)ν 2. By minimizing the asymptotic MSE, we 11

12 can get the asymptotic optimal bandwidth h for estimating β 0 : [ ] 1/7 3g(0)ν2 ĥ opt = n 1/7. (2.12) {g (0)} Finite sample breakdown point To better investigate how robust the MODLR estimator is, we also calculate its finite sample breakdown point. Breakdown point is used to quantify the proportion of bad data in a sample that an estimator can tolerate before returning arbitrary values. Since usually the breakdown point is most useful in a small sample setup (Donoho, 1982; Donoho and Huber, 1983), we will mainly focus on the finite sample breakdown point. A number of definitions for the finite sample breakdown point have been proposed (see, for example, Hampel, 1971, 1974; Donoho, 1982; Donoho and Huber, 1983). In this paper, we shall work with the finite sample contamination breakdown point. Let z i = (x i, y i ). Given the sample Z = (z 1,..., z n ), denote T (Z) the MODLR estimate of β, as defined in (2.3). We can corrupt the original sample Z by adding m arbitrary points Z = (z n+1,..., z n+m ). The corrupted sample Z Z then has sample size n + m and contains a fraction δ = m/(m + n) of bad values. The finite sample contamination breakdown point δ is defined as { } m δ (Z, T ) = min 1 m n n + m : sup T (Z Z ) =, (2.13) Z where is Euclidean norm. Theorem 2.5. Given the observations Z = (z 1,..., z n ), suppose T (Z) = ˆβ is the MODLR estimate defined in (2.3). Let M = 2πh φ h (y i x T ˆβ). i (2.14) 12

13 Then the finite sample contamination breakdown point of MODLR is δ (Z, T ) = m n + m, (2.15) where m is an integer satisfying M m M +1, a is the largest integer not greater than a, and a is the smallest integer not less than a. The proof of Theorem 2.5 is given in the Appendix. From the above theorem, we can see that the breakdown point depends not only on φ( ), and the tuning parameter h, but also on the sample configuration. (However, Huber (1984) pointed out if the scale (contained in the bandwidth h of the MODLR estimate) is determined from the sample itself, empirically, the breakdown point is quite high.) 3. Simulation Study and Application In this section, we will conduct a Monte Carlo simulation to assess the performance of our proposed MODLR estimator for finite sample size and compare it with some other regression methods. A real data application is also provided. To use the modal regression estimator, we need to first select the bandwidth. Note that the asymptotic optimal bandwidth formula (2.9) contains the unknown quantities g (v) (0 x), v = 0, 2, 3, the vth derivative of conditional density of ɛ given x. Hence, they are not ready to use. One of the commonly used methods is to replace the unknown quantities by some estimates. Given the initial residual ˆɛ i = y i x T ˆβ, i where ˆβ is the traditional least squares estimate (or a robust estimate if there are some outliers) of β, we can estimate their mode, denoted by ˆm, by maximizing the kernel density estimator (Paren, 1962). Under the assumption of independence of ɛ and x, ˆɛ i ˆm approximately has density g( ) and thus 13

14 g (v) (0 x) can be estimated by (See, for example, Silverman, 1986 and Scott, 1992) ĝ (v) (0 x) = 1 nh v+1 { } K (v) ˆɛi ˆm, v = 0, 2, 3, h where K (v) ( ) is the vth derivative of kernel density function K( ). Then, we can estimate J, K, and L by Ĵ = n 1 ĝ (0 x i )x i x T i, ˆK = n 1 ĝ (0 x i )x i, and ˆL = n 1 ĝ(0 x i )x i x T i, and apply the formula (2.9) to estimate ĥopt. To refine the bandwidth selection, one might further iteratively update the chosen bandwidth by recalculating the residual ˆɛ i based on the modal linear regression estimate A Monte Carlo simulation Example 1: Generate iid sample {(x i, y i ), i = 1,..., n} from the following model Y = 1 + 3X + σ(x)ε, where X U(0, 1), ε 0.5N( 1, ) + 0.5N(1, ), and σ(x) = 1 + 2X. Note that E(ε) = 0, Mode(ε) = 1, and Median(ε) = 0.67 (the last two quantities are approximate). The traditional mean regression function is E(Y X) = 1 + 3X. The proposed modal regression function is Mode(Y X) = 2 + 5X, with modal regression residual Y Mode(Y X) = (1 + 2X)(ε 1), whose distribution peaks at 0, but is negatively skewed. The median regression is Median(Y X) = X. We consider and compare the following four methods: 1) traditional mean regression based on least squares estimate (LSE); 2) median regression (MEDREG); 3) MM-estimate (M-estimate with a initial robust M-estimate of scale, Yohai, 1987) based on Tukey s biweight ψ-function; 4) the proposed modal linear regression (MODLR). 14

15 Figure 1 is the scatter plot of a typical generated sample with n = 200 and four regression curves. From the plot, we can see that the modal line goes through the area containing the most number of points. A small prediction band around this line is expected to contain the most number of future points. In contrast, the mean regression line based on LSE is skewed to a flatter line and lies in a much lower dense area for capturing the conditional mean. The regression lines based on median regression and MM-estimate lie in higher density areas than the regression line based on LSE. Table 1 reports the average and standard error (Std) of the parameter estimates for each method based on 1,000 replicates. From the table, we can see that LSE, MEDREG, and MODLR can estimate their target parameters well. However, MM-estimate can not estimate the mean regression well since the assumption of symmetric error density is violated. Surprisingly, the MODLR has smaller standard error than other methods in this example when the error is skewed, especially when n = 200 or n = 400. Therefore, for finite sample, the MODLR estimator not only has a good modal explanation but also might have better estimation accuracy than other methods when the error is skewed. In addition, MEDREG and MM-estimate also have smaller standard errors than LSE, due to their robustness. To see why the modal linear regression has better predictive performance than the traditional mean regression, in Table 2, we report the average and standard error of the estimated coverage probabilities when doing prediction based on the same length of intervals centered around each estimate. We consider the lengths of intervals, centered around each estimate, being 0.1σ, 0.2σ, and 0.5σ, where σ = 2 is the approximate standard error of ε. For each of 1,000 replications, we estimate the coverage probability by doing prediction for the 1,000 equally spaced predictors from 0.1 to 0.9. From Table 2, we can see that MODLR provides higher coverage probabilities than all other three methods. Therefore, when the errors are skewed, a small prediction band around the modal line can have larger coverage probabilities than the one around the mean regression line or median regression line with the same length. 15

16 In addition, MEDREG provides larger coverage probabilities than MM-estimate and LSE and MM-estimate provides larger coverage probabilities than LSE. Note that when the length of intervals is large enough, different methods will provide similar coverage probabilities Observation LSE MEDREG MM estimate MODLR 5 y x Figure 1: Scatter plot of a typical sample with n = 200 for Example 1 with different estimated regression lines:. denotes the mean regression line based on LSE; denotes the median regression line; denotes the regression line based on MM-estimate; denotes the modal regression line. 16

17 Table 1: Average (Std) of parameter estimates over 1,000 repetitions. Method Parameter n=50 n=100 n=200 n=400 LSE β 0 = (0.964) 0.989(0.659) 1.007(0.490) 1.009(0.322) β 1 = (2.260) 3.063(1.500) 2.977(1.160) 2.976(0.733) MEDREG β 0 = (0.707) 1.613(0.422) 1.636(0.301) 1.667(0.188) β 1 = (1.670) 4.372(0.981) 4.339(0.705) 4.312(0.457) MM-estimate β 0 = (0.782) 1.040(0.530) 1.022(0.376) 1.035(0.265) β 1 = (1.640) 5.234(1.060) 5.271(0.744) 5.271(0.512) MODLR β 0 = (0.670) 1.841(0.372) 1.875(0.229) 1.912(0.140) β 1 = (1.750) 5.024(0.948) 5.044(0.574) 5.020(0.387) 3.2. The forest fires data The occurrence of forest fires, also called wildfires, is one of the major concerns that affect forest preservation, and create ecological and economical damage. Fast detection is vital for successful firefighting. Since traditional human surveillance and the automatic solutions by satellite-based or infrared/smoke scanners are expensive, it receives much attention to use the low costs meteorological data to warn the public of a fire danger and to support fire management decision. In this section, we illustrate the proposed methodology by an analysis of forest fire data (Cortez and Morais, 2007). This data is available at: pcortez/forestfires/. This forest fire data contains 517 observations and was collected from January 2000 to December 2003 from the Montesinho natural park of the Trás-os-Montes northeast region of Portugal. At a daily basis, every time a forest fire occurred, many features are recorded, such as time, date, spatial location, and weather conditions. Following Cortez and Morais (2007), we will use the four meteorological variables: outside temperature (temp), outside relative humidity (RH), outside wind speed (wind), and outside rain (rain), to predict the total burned area (area). We fit the data by LSE, MEDREG, MM-estimate, and MODLR. One important feature of this data set is that it contains outliers and positive skewed re- 17

18 Table 2: Average (Std) of coverage probabilities over 1,000 repetitions with σ = 2. Width Method n=50 n=100 n=200 n= σ LSE 0.034(0.015) 0.032(0.011) 0.030(0.009) 0.029(0.007) MEDREG 0.073(0.018) 0.077(0.014) 0.078(0.012) 0.080(0.010) MM-estimate 0.065(0.023) 0.067(0.019) 0.066(0.015) 0.067(0.012) MODLR 0.087(0.016) 0.092(0.012) 0.095(0.010) 0.095(0.009) 0.2σ LSE 0.069(0.028) 0.065(0.022) 0.061(0.015) 0.059(0.013) MEDREG 0.144(0.033) 0.153(0.024) 0.155(0.019) 0.158(0.015) MM-estimate 0.129(0.042) 0.133(0.034) 0.132(0.027) 0.134(0.021) MODLR 0.170(0.027) 0.179(0.018) 0.184(0.013) 0.186(0.012) 0.5σ LSE 0.186(0.062) 0.181(0.047) 0.174(0.035) 0.171(0.028) MEDREG 0.338(0.061) 0.355(0.040) 0.360(0.029) 0.365(0.022) MM-estimate 0.313(0.080) 0.322(0.062) 0.325(0.046) 0.330(0.036) MODLR 0.378(0.049) 0.395(0.029) 0.404(0.018) 0.407(0.015) sponse variable (area). Therefore, it is expected that the traditional mean regression based on LSE won t work well. To see why MODLR has better prediction properties than other methods, we compare the widths of the prediction intervals with the same confidence level. Suppose we have obtained the parameter estimate ˆβ and the corresponding residuals ˆɛ 1,..., ˆɛ n from fitting the data; we will use ˆɛ [i] to denote the ith smallest value of the residuals. The traditional method for constructing a prediction interval for new predictor x new is (x T ˆβ new ˆɛ [n1 ], x T ˆβ new + ˆɛ [n2 ]), where n 1 = nα/2, and n 2 = n n 1. This symmetric method will be ideal if the regression error is symmetric. Since MODLR focuses on the highest conditional density region, we propose a method for MODLR to use the information of skewed error density to construct the prediction intervals with (1 α)100% confidence level. Suppose ˆf( ) is the kernel density estimate of ɛ based on the residuals ˆɛ 1,..., ˆɛ n that are estimated by MODLR. We propose to find the indexes k 1 < k 2 such that k 2 k 1 = n 2 n 1 = n(1 α) and ˆf(ˆɛ [k1 ]) ˆf(ˆɛ [k2 ]). Then the proposed prediction interval for new predictor x new is (x T ˆβ new ˆɛ [k1 ], x T ˆβ new + ˆɛ [k2 ]). 18

19 Table 3: Average widths (percentage of coverage) of the prediction intervals Methods Nominal confidence levels 10% 30% 50% 90% LSE 2.166(0.101) 6.687(0.294) 12.70(0.493) 53.03(0.896) MEDREG 0.975(0.091) 2.638(0.292) 6.506(0.491) 48.52(0.894) MM-estimate 1.144(0.099) 2.910(0.294) 6.499(0.497) 48.49(0.906) MODLR 0.012(0.112) 0.035(0.311) 0.571(0.499) 26.44(0.899) We propose the following iterative algorithm to find indexes k 1 and k 2. Let k 1 = n 1 and k 2 = n 2 be the initial values for k 1 and k 2. Step 1: If ˆf(ˆɛ[k1 ]) < ˆf(ˆɛ [k2 ]) and ˆf(ˆɛ [k1 +1]) < ˆf(ˆɛ [k2 +1]), k 1 = k and k 2 = k 2 + 1; if ˆf(ˆɛ [k1 ]) > ˆf(ˆɛ [k2 ]) and ˆf(ˆɛ [k1 1]) > ˆf(ˆɛ [k2 1]), k 1 = k 1 1 and k 2 = k 2 1. Step 2: Iterate the above procedure until none of above two conditions is satisfied or (k 1 1)(k 2 n) = 0. In Table 3, we report the average widths and the actual coverage rates of the prediction intervals for 10%, 30%, 50%, and 90% confidence levels. The actual coverage rates are estimated based on leave-one-out cross validation. From Table 3, we have the following findings: 1. All the prediction intervals are well-calibrated the actual coverage rates are very close to the nominal confidence levels. 2. The average widths of prediction intervals constructed around MODLR are much shorter than the prediction intervals constructed around other three estimates. 3. Both MEDREG and MM-estimate have shorter prediction intervals than LSE. 19

20 4. Summary and Discussions In this article, we have proposed a new powerful data analysis tool, called modal regression, to explore the relationship between a response variable and a set of covariates. The modal regression investigates the relationship using the most likely conditional values instead of the conditional mean used by traditional regression technique. When the error is skewed, the modal regression provides a more meaningful prediction than the LSE. Our simulation results demonstrated that the modal regression has larger coverage probability when doing prediction than some other commonly used regression methods if the same length of short intervals, centered around each estimate, are used. Further research can be conducted to find how to construct the shortest (skewed) prediction interval for a given confidence level using the information of skewed error density. One related work is Kim and Lindsay (2011). They proposed to use confidence distribution sampling to visualize confidence sets. In the application of forest fire data application, we provided one possible way to construct the skewed prediction interval for MODLR. Based on the cross validation results, the proposed skewed prediction intervals for MODLR were much shorter than the prediction intervals constructed by some of other commonly used regression methods for forest fire data. The modal linear regression assumes that the mode of the conditional density of Y given x is a linear function of x. The idea of modal linear regression can be easily generalized to other models such as nonlinear regression, nonparametric regression, and varying coefficient partially linear regression. In addition, it will be also interesting to see how to select the informative variables based on the modal regression idea. These will be our future research work. Appendix The following technical conditions are imposed in this section. 20

21 (A1) g (v) (t x), v = 0, 1, 2, 3 is continuous in a neighborhood of 0, and g (0 x) = 0 for any x. (A2) n 1 n g (0 x i )x i x T i = J + o p (1), n 1 n g (0 x i )x i = K + o p (1), and n 1 n g(0 x i)x i x T i = L + o p (1), where J < 0, i.e., J is a positive definite matrix. (A3) n 1 n x i 4 = O p (1). The conditions above are mild and are fulfilled in many applications. Proof of Theorem 2.1: Note that { } { } log{q h (β (k+1) )} log{q h (β (k) )} = log φ h (y i x T i β (k+1) ) log φ h (y i x T i β (k) ) = log = log = log [ [ [ Based on the Jensen s inequality, we have ] φ h (y i x T i β (k+1) ) n φ h(y i x T i β(k) ) φ h (y i x T i β (k) ) n φ h(y i x T i β(k) ) π(i β (k) ) φ h(y i x T i β (k+1) ) φ h (y i x T i β(k) ) ] φ h (y i x T i β (k+1) ) φ h (y i x T i β(k) ) ] log{q h (β (k+1) )} log{q h (β (k) )} { } π(i β (k) φ h (y i x T i β (k+1) ) ) log. φ h (y i x T i β(k) ) Based on the property of M step in (2.6), we have log{q h (β (k+1) )} log{q h (β (k) )} 0 and thus Q h (β (k+1) ) Q h (β (k) ). 21

22 Proof of Theorem 2.2: Note that φ h(t) = h 3 ( t2 h 2 1)φ(t/h) and φ h(t) = t h 3 φ(t/h). Let a n = (nh 3 ) 1/2 + h 2. It is sufficient to show that for any given η > 0, there exists a large constant c such that P { sup Q h (β 0 + a n µ) < Q h (β 0 )} 1 η. µ =c (A.1) Let X = (x 1,..., x n ) T and y = (y 1,..., y n ) T. Denote K n Q h(β 0 ) β J n 2 Q h (β 0 ) ββ T = 1 n = 1 n φ h(y i x T i β 0 )x i where Q h (β) is defined in (2.3) and β 0 is the true parameter value. By directly calculating the mean and variance, we can get (A.2) φ h(y i x T i β 0 )x i x T i, (A.3) E(J n x) = 1 n g (0 x i )x i x T i {1 + o p (1)} = J{1 + o p (1)}, Var(J n x) = O p {(nh 5 ) 1 }, E(K n x) = h2 g (0 x i )x i (1 + o p (1)) = h2 2n 2 K{1 + o p(1)}, Cov(K n x) = 1 n 2 h ν 3 2 g(0 x i )x i x T i {1 + o(1)} = 1 nh ν 2L{1 + o 3 p (1)}, (A.4) where J = lim n 1 n g (0 x i )x i x T i, K = lim n 1 n g (0 x i )x i, and L = lim n 1 n g(0 x i )x i x T i. By default, when calculating the variance of a matrix, we find the variance of each element of the matrix. Using the result X = E(X) + O p ({Var(X)} 1/2 ), since nh 5, 22

23 J n = J + o p (1). Notice that Q h (β 0 + a n µ) Q h (β 0 ) = a n K T n µ + a2 n 2 µt J n µ a3 n 6nh 4 φ ( y i x T i β )(x T i µ) 3 h = M 1 + M 2 + M 3, (A.5) where u = c and β β 0 ca n. From (A.4), we get K n = O p (a n ) and hence M 1 = O p (a 2 n). Note that M 2 = 0.5a 2 nµ T Jµ{1 + o p (1)}. Based on the boundness of φ (4) (t) and β β 0 ca n, we have φ ( y i x T i β h ) = φ ( y i x T i β 0 )(1 + o p (1)). h Noting that φ (t) = (3t t 3 )φ(t), based on the Taylor expansion and the symmetric property of φ(t), we have that { ( )} { ( )} E φ Yi x T i β 0 = O p (h 4 ), Var φ Yi x T i β 0 = O p (h). (A.6) h h Since nh 5, we can prove that M 3 = o p (a 2 n). For any η > 0, we can choose c big enough, such that the second term M 2 dominates the other two terms in (A.5) with probability 1 η. Since J < 0, Q h (β 0 + a n µ) Q h (β 0 ) < 0 with probability 1 η. The result of Theorem 2.2 follows. Proof of Theorem 2.3: Suppose ˆβ is the consistent solution to Q h (β)/ β found in Theorem 2.2. Based on the Taylor expansion, we have 0 = Q h(ˆβ) β = K n + (J n + L n )(ˆβ β 0 ), (A.7) 23

24 where L n = 1 2nh 4 [ φ ( Y i β T x i ) h ] } {(ˆβ β 0 ) T x i x i x T i, where β β 0 ˆβ β 0. Based on the result of (A.7), we have ˆβ β 0 = (J n +L n ) 1 K n. Since ˆβ β 0 = O p (a n ), where a n = (nh 3 ) 1/2 + h 2, similar to the proof of M 3 in (A.5), we have L n = o p (1). Hence, based on (A.4), we have ˆβ β 0 = J 1 K n (1+o p (1)). Next we prove the asymptotic normality for Kn = nh 3 K n. For any unit vector d R p+1, we prove {d T Cov(K n)d} 1 2 {d T K n d T E(K n)} D N(0, 1) Let ξ i = 1 φ ( Y i x T i β T 0 )d T x i. nh h Then d T K n = n ξ i. We check the Lyapunov s condition. Based on the results (A.4), we know Cov(K n ) = L nh 3 ν 2{1 + o(1)}. (A.8) Hence Var(d T K n) = nh 3 d T Cov(K n )d = g(0)ν 2 d T Ld + o(1). So we only need to prove ne ξ Noticing that (d T x i ) 2 x i 2 d 2 = x i 2, and φ ( ) is bounded, we have ne ξ 1 3 O{(nh 3 ) 1/2 } 0. So, the asymptotic normality for K n holds, i.e., we have nh 3 { K n h 2 K/2(1 + o p (h 2 )) } D N(0, ν 2 L). Based on the Slutsky s theorem, we have nh 3 ] [ˆβ β h2 2 J 1 K{1 + o p (h 2 D )} N { 0, ν 2 J 1 LJ 1}. 24

25 Proof of Theorem 2.4: Since ˆβ s has root n consistency, the asymptotic result of ˆβ 0 is the same as if ˆβ s were known and its asymptotic distribution can be derived from Theorem 2.3 by assuming x = 1 and the independence of ɛ and x, under which we have J 1 K = g (0)/(2g (0)) and J 1 LJ 1 = g (0) 2 g(0). Then the result follows. Proof of Theorem 2.5: Let φ (t) = 2πhφ h (t). Then M = n φ (y i x T i ˆβ), where ˆβ = T (Z). Notice that φ ( ) has a maximum at 0 with φ (0) = 1 and φ ( ) decreases monotonely toward both sides, and that lim φ (t) = 0 for t. We first prove that T (Z Z ) stays bounded if m < M. Let ξ > 0 be such that m + nξ < M, and let C be such that φ (t) ξ for t C. Let β be any real vector such that y x T β C for all z = (x, y) in Z. Then m+n φ (y i x T i T (Z)) M (A.9) and m+n φ (y i x T i β) nξ + m. (A.10) From (A.9) and (A.10), one knows that T (Z Z ) must satisfy y x T T (Z Z ) < C for a point in Z and thus T (Z Z ) is bounded. On the other hand, if m > M, let ξ > 0 such that m mξ > M, and let C be such that φ (t) ξ for t C. Assume that all points {(x n+1, y n+1 ),..., (x n+m, y n+m )} in Z are the same and satisfy a linear relationship y = x T β. Let β be any vector such that y n+1 x T n+1β < C. Then m+n φ (y i x T i β) M + mξ, (A.11) 25

26 and m+n φ (y i x T i β ) m. (A.12) From (A.11) and (A.12), one knows that T (Z Z ) must satisfy y n+1 x T n+1t (Z Z ) C. If we let y n+1 with x n+1 fixed, T (Z Z ) must go off to infinity, and we have breakdown. References Chaudhuri, P. and Marron, J. S. (1999). Sizer for Exploration of Structures in Curves. Journal of the American Statistical Association, 94, Cortez, P. and Morais, A. (2007). A Data Mining Approach to Predict Forest Fires using Meterological Data. Proceedings of the 13th EPIA Portuguese Conference on Artificial Intelligence, December, Guimaraes, Portugal, Davies, P. L. and Kovac, A. (2004). Densities, Spectral Densities and Modality. Annals of Statistics, 32, Donoho, D. L. (1982). Breakdown Properties of Multivariate Location Estimators. Ph.D. Qualifying paper, Department of Statistics, Harvard University. Donoho, D. L. and Huber, P. J. (1983). The Notion of Breakdown Point. In: Doksum, K., Hodges, J. L. (Editors), Festchrift in Honor of Erich Lehman. Wadsworth, Belmont, CA. Eddy, W. F. (1980). Optimum Kernel Estimators of the Mode. Annals of Statistics, 8, Fisher, N. I. and Marron, J. S. (2001). Mode Testing via The Excess Mass Estimate. Biometrika, 88,

27 Friedman J. H. and Fisher, N. I. (1999). Bump Hunting in High-Dimensional Data. Statistics and Computing, 9, Hall, P., Minnotte, M. C., and Zhang, C. (2004). Bump Hunting with Non-Gaussian Kernels. Annals of Statistics, 32, Hampel, F. R. (1971). A General Quantitative Definition of Robustness. Annals of Mathematiccal Statistics, 42, Hampel, F. R. (1974). The Influence Curve and Its Role in Robust Estimation. Journal of the American Statistical Association, 69, Huber, P. J. (1984). Finite Sample Breakdown Points of M- and P-Estimators. Annals of Statistics, 12, Kemp, G. C. R. and Santos Silva, J. M. C. (2010). Regression towards the mode. Department of Economics, University of Essex, Discussion Paper No jmcss/research.html. Kim, D. and Lindsay, B. G. (2011). Using Confidence Distribution Sampling to Visualize Confidence Sets. Statistica Sinica, 21(2), Lee, M. J. (1989). Mode Regression Journal of Econometrics, 42, Li, J., Ray, S., and Lindsay, B. G. (2007). A Nonparametric Statistical Approach to Clustering via Mode Identification. Journal of Machine Learning Research, 8(8), Muller, D. W. and Sawitzki, G. (1991). Excess Mass Estimates and Tests for Multimodality. Journal of the American Statistical Association, 86, Parzen, E. (1962). On Estimation of a Probability Density Function and Mode. The Annals of Mathematical Statistics, 33,

28 Ray, S. and Lindsay, B. G. (2005). The Topography of Multivariate Normal Mixtures. Annals of Statistics, 33, Silverman, B. W. (1986). Density Estimation for Statistics and Data Analysis. Chapman and Hall: London. Scott, D. W. (1992). Multivariate Density Estimation: Theory, Practice and Visualization. New York: Wiley. Yao, W. and Lindsay, B. G. (2009). Bayesian Mixture Labeling by Highest Posterior Density. Journal of American Statistical Association, 104, Yohai, V. J. (1987). High Breakdown-point and High Efficiency Robust Estimates for Regression. The Annals of Statistics, 15,

Nonparametric Modal Regression

Nonparametric Modal Regression Summary In this article, we propose a new nonparametric modal regression model, which aims to estimate the mode of the conditional density of Y given predictors X. The nonparametric