NONPARAMETRIC MIXTURE OF REGRESSION MODELS

Size: px

Start display at page:

Download "NONPARAMETRIC MIXTURE OF REGRESSION MODELS"

Baldric Morgan
5 years ago
Views:

1 NONPARAMETRIC MIXTURE OF REGRESSION MODELS Huang, M., & Li, R. The Pennsylvania State University Technical Report Series #09-93 College of Health and Human Development The Pennsylvania State University

2 1 NONPARAMETRIC MIXTURE OF REGRESSION MODELS Mian Huang Department of Statistics Shanghai University of Finance and Economics Shanghai, , P. R. China Runze Li Department of Statistics and The Methodology Center The Pennsylvania State University University Park, PA, , USA Abstract: Motivated by an analysis of US house price index data, we propose nonparametric finite mixture of regression models. We develop an estimation procedure of the newly proposed models using EM algorithm and kernel regression. The asymptotic property of the proposed estimate are systematically studied. The asymptotic bias and variance of the resulting estimate are derived. We further establish the asymptotic normality of the proposed estimate. Monte Carlo simulations are conducted to examine the finite sample performance of the proposed estimation procedures. An empirical analysis of the US house price index data is illustrated for the proposed methodology. Key words and phrases: EM algorithm, kernel regression, mixture of regression models, nonparametric regression. 1. Introduction Mixture models have been widely used in econometrics and social science, and the theories for mixture models have been well studied (Lindsay, 1995). As a class of useful models of the mixture models, finite mixture of linear regression models has been applied in various fields in the literature since its introduction in Goldfeld and Quandt (1976). For example, there are applications in econometrics and marketing (DeSarbo, 1993; Frühwirth-Schnatter, 2001; Rossi et al., 2005), in epidemiology (Green and Richardson, 2002), and in biology (Wang et al., 1996). Bayesian approaches for mixture regression models are summarized in Frühwirth-Schnatter (2005). Many efforts have been made to these models

3 2 MIAN HUANG AND RUNZE LI and their extensions such as finite mixture of generalized linear models, and comprehensively summarized in McLachlan and Peel (2000). Motivated by an analysis of US house price index data in section 3.4, we propose nonparametric finite mixtures of regression models. Compared with finite mixture of linear regression models, the newly proposed models relax the linearity assumption on the regression function, and allow that the regression function in each component is an unknown but smooth function of its covariates. In this paper, we consider the situation in which the mixing proportion, the mean functions and the variance functions are all nonparametric. Using a kernel regression technique, we develop an estimation procedure for the unknown functions in the nonparametric mixture of regression model via local likelihood approach. The sampling properties of the proposed estimation procedure are investigated. We derive the asymptotic bias and variance of the local likelihood estimate, and establish its asymptotic normality. For the estimation procedure, one may naively implement an EM algorithm (Dempster et al., 1977) for maximizing the local likelihood function. However, it is desired to estimate the whole curves by evaluating the resulting estimate over a set of grid points. The naive implementation of the EM algorithm does not ensure that the component labels will match correctly at different grid points. This is similar to the label switching problem in older applications of mixture modeling. We modify the EM algorithm to simultaneously maximize the local likelihood functions for the proposed nonparametric mixture of regression models at the set of grid points. The modified EM algorithm works well in our simulation and in a real data example. We further study the ascent property of the proposed EM algorithm. We derive a standard error formula for the resulting estimate by the conventional sandwich formula. A bandwidth selector is proposed for the local likelihood estimate using a multi-fold cross-validation method. A simulation study is conducted to examine the performance of the proposed procedure and test the accuracy of the proposed standard error formula. We further demonstrate the proposed model and estimation procedure by an empirical analysis of US house price index data. The rest of this paper is structured as follows. In section 2, we present

4 NONPARAMETRIC MIXTURE OF REGRESSIONS 3 the nonparametric finite mixture of regression models, and derive asymptotic property for the local likelihood estimate. We further develop an estimation procedure for the proposed model using EM algorithm and kernel regression. Simulation results and an empirical analysis of a real data are presented in section 3. Some discussions are provided in section 4. Technical conditions and proofs are given in section Estimation Procedure and its Sampling Properties 2.1. Model Definition Suppose that {(x i, y i ), i = 1,, n} are random samples from the population (X, Y ). Throughout this paper, we assume X is univariate. The proposed methodology and theoretical results can be extended to multivariate X, but the extension is less useful due to the curse of dimensionality. Let C be a latent class variable, and assume that conditioning on X = x, C has a discrete distribution P (C = c X = x) = π c (x) for c = 1, 2,, C. Conditioning on C = c and X = x, Y follows a normal distribution with mean m c (x) and variance σc 2 (x), where m c ( ) and σc 2 ( ) are unknown but smooth functions. In other words, conditioning on X = x, the response variable Y follows a finite mixture of normals C π c (x)n{m c (x), σc 2 (x)}. (2.1) c=1 In this paper, we assume that C is fixed, and refer to model (2.1) as a nonparametric finite mixture of regression models because π c ( ), m c ( ) and σc 2 ( ) are nonparametric. We will discuss how to determine C in section 3. When π c (x) and σc 2 (x) are constant, and m c (x) is linear in x, model (2.1) reduces to a finite mixture of linear regression models (Goldfeld and Quandt, 1976). When C = 1, model (2.1) is a nonparametric regression model. See Chapter 2 of Fan and Gijbels (1996) for a review of nonparametric regression model. Thus, model (2.1) can be regarded as a natural extension of nonparametric models and finite mixture of linear regression models. It is easy to adapt the conventional constraints imposed on the finite mixture of linear regression models for the proposed model so that it is identifiable, and the corresponding likelihood function is bounded. For further references on these issues, see Hening (2000) and Hathaway (1985). Since π c ( ), m c ( ) and σc 2 ( ) are nonparametric, we need nonparametric smooth-

5 4 MIAN HUANG AND RUNZE LI ing techniques for (2.1). In this paper, we will employ kernel regression for model (2.1). Suppose we want to estimate the unknown functions at x 0. In kernel regression, we first use local constants π c0, σ 2 c0, and m c0 to approximate π c (x 0 ), σ 2 c (x 0 ), and m c (x 0 ). function K( ) with a bandwidth h. Let K h ( ) = h 1 K( /h) be a rescaled kernel of a kernel Furthermore, denote φ(y µ, σ 2 ) to be the density function N(µ, σ 2 ). Then, the corresponding local log-likelihood function for data {(x i, y i ), i = 1, 2,, n} is { C } l n (π 0, σ 2 0, m 0 ) = log π c0 φ(y i m c0, σc0) 2 K h (x i x 0 ), (2.2) c=1 where m 0 = (m 10,, m C0 ) T, σ 2 0 = (σ2 10,, σ2 C0 )T, π 0 = (π 10,, π C 1,0 ) T, and π C0 = 1 C 1 c=1 π c0. One may also apply local linear regression or local polynomial regression techniques for estimation of π c (x), m c (x), and σ 2 c (x). Local linear regression has several nice statistical properties (Fan and Gijbels, 1996). However, in our model setting local linear regression does not yield a closed-form solution for variance functions and mixing proportion functions in the M-step of the proposed EM algorithm. See (2.7), (2.8), and (2.9) for more details. Thus, we use kernel regression to estimate the nonparametric functions. Let {ˆπ, ˆσ 2, ˆm} be the maximizer of the local likelihood function (2.2). Then the estimates for π c (x 0 ), σ 2 c (x 0 ), and m c (x 0 ) are ˆπ c (x 0 ) = ˆπ c0, ˆσ 2 c (x 0 ) = ˆσ 2 c0, and ˆm c (x 0 ) = ˆm c Asymptotic Properties We next study the asymptotic properties of ˆπ c (x 0 ), ˆσ 2 c (x 0 ) and ˆm c (x 0 ). Let θ = (π T, σ 2T, m T ) T, and denote η(y θ) = C π c φ { y m c, σc 2 }, and l(θ, y) = log η(y θ). c=1 Let θ(x 0 ) = {π T (x 0 ), σ 2 (x 0 ) T, m(x 0 ) T } T, and define η 1 {θ(x), y} = η{y θ(x)}, η 2 {θ(x), y} = 2 η{y θ(x)} θ θ θ T, l{θ(x), y} q 1 {θ(x), y} =, q 2 {θ(x), y} = 2 l{θ(x), y} θ θ θ T, I(x 0 ) = E[q 2 {θ(x), Y } X = x 0 ], and Λ(x) = q 1 {θ(x 0 ), y}.η{y θ(x)}dy Y

6 Denote γ n = (nh) 1/2, NONPARAMETRIC MIXTURE OF REGRESSIONS 5 ˆm c = nh{ ˆm c0 m c (x 0 )}, ˆσ 2 c = nh{ˆσ 2 c0 σ 2 c (x 0 )}, ˆπ c = nh{ˆπ c0 π c (x 0 )}, ˆπ C = nh{ˆπ C0 π C (x 0 )} = C 1 nh[1 {ˆπ c0 π c (x 0 )}]. Let ˆm = ( ˆm 1,, ˆm C )T, ˆσ 2 = (ˆσ 2 1, ˆσ2 C )T, and ˆπ = (ˆπ 1,, ˆπ C 1 )T, and ˆθ = {(ˆπ ) T, (ˆσ 2 ) T, ( ˆm ) T } T. The asymptotic bias, variance, and normality of the resulting estimate are given in the following theorem. The proof is given in section 5. Theorem 1 Suppose that conditions (A) (F) in section 5 hold. Then it follows that nh{γnˆθ B(x0 ) + o p (h 2 )} c=1 D N { 0, f 1 (x 0 )ν 0 I 1 (x 0 ) }, where f( ) is the marginal density function of X, { f B(x 0 ) = I 1 (x 0 )Λ (x 0 ) (x 0 ) + 1 } f(x 0 ) 2 Λ (x 0 ) κ 2 h 2, κ l = u l K(u) du, and ν l = u l K 2 (u) du An Effective EM Algorithm For a given x 0, one may maximize the local likelihood function (2.2) using an EM algorithm easily. However, in practice we typically want to evaluate the unknown functions at a set of grid points over an interval of x. This requires us to maximize the local likelihood function (2.2) at different grid points. This imposes some challenges because the labels in the EM algorithm may change at different grid points, which is similar to the label switching problem which occurs when one conducts a bootstrap to get a confidence interval for parameters in a mixture model. Thus, a naive implementation of the EM algorithm may fail to yield smooth estimated curves. An illustration is given in Figure 2.1, in which the solid curves are obtained when the labels at the grid points u j 1, u j and u j+1 match correctly, while the dotted curves are obtained when the labels at u j do not match with the ones in u j 1 and u j+1 correctly.

7 6 MIAN HUANG AND RUNZE LI No "label switching" "Label switching" y u(j 1) u(j) u(j+1) Figure 2.1: An illustration of mismatching labels of naive implementation of EM algorithm. Solid curves are obtained when the labels at u j 1, u j and u j+1 match correctly. Dotted curves are obtained when the labels at u j does not match the ones in u j 1 and u j+1 correctly. In this section, we propose an effective EM algorithm to deal with this issue. The key idea is that at the M-step of the EM algorithm, we update the estimated curves at all grid points for the given label in the E-step. Define Bernoulli random variables z ic = { 1, if (xi, y i ) is in the c th group, 0, otherwise. and let z i = (z i1,, z ic ) T. The complete data are {(x i, y i, z i ), i = 1, 2,, n}, and the complete log-likelihood function is c=1 C [ z ic log πc (x i ) + log φ{y i m c (x i ), σc 2 (x i )} ]. In the l-th step of the EM algorithm iteration, we have m (l) c ( ), σ 2(l) ( ), and c

8 NONPARAMETRIC MIXTURE OF REGRESSIONS 7 π (l) ( ). In the E-step, the expectation of the latent variable z ic is given by ic = π c (l) (x i )φ{y i m (l) c (x i ), σc 2(l) (x i )} C c=1 π(l) c (x i )φ{y i m (l) c (x i ), σc 2(l) (x i )}. (2.3) r (l) Let {u 1,, u N } be a set of grid points at which the unknown functions are evaluated, where N is the number of grid points. In the M-step, we maximize C c=1 r (l) [ ic log πc0 (x 0 ) + log φ{y i m c0 (x 0 ), σc0(x 2 0 )} ] K h (x i x 0 ), (2.4) for x 0 = u i, i = 1,, N. In practice, if n is not very large, one may directly set the observed {x 1,, x n } to be the grid points. In such case, N = n. and The maximization of (2.4) is equivalent to maximizing for c = 1,, C, r (l) ic log π c0(x 0 )K h (x i x 0 ), (2.5) r (l) ic log φ{y i m c0 (x 0 ), σ 2 c0(x 0 )}K h (x i x 0 ), (2.6) separately. The solution for (2.5) is, for x 0 {u j, j = 1,, N}, π (l+1) c0 (x 0 ) = n r(l) ic K h(x i x 0 ) n K. (2.7) h(x i x 0 ) The closed form solution for (2.6) is, for x 0 {u j, j = 1,, N}, where w (l) ic and σ 2(l+1) c σ 2(l+1) c0 m (l+1) c0 (x 0 ) = σ 2(l+1) c0 (x 0 ) = w (l) ic y i/ n w(l) n w(l) ic = r (l) ic K h(x i x 0 ). (x i ), i = 1,, n by linearly interpolating π c (l+1 w (l) ci, (2.8) ic {y i m (l+1) c0 (x 0 )} 2, (2.9) Furthermore, we update π c (l+1) (x i ), m (l+1) c (x i ) (u j ), m (l+1) (u j ) and (u j ), j = 1,, N, respectively. With initial values of π c ( ), m c ( ) and σ 2 ( ), the proposed estimation procedure is summarized in the following algorithm. An EM algorithm: c0

9 8 MIAN HUANG AND RUNZE LI Initial Value: Obtain an initial value for π c ( ), m c ( ) and σc 2 ( ) by conducting a mixture of polynomial regressions. Denote the initial value by π c (1) ( ), m (1) c ( ) and σc 2(1) ( ). Set l = 1. E-step: Use (2.3) to calculate r (l) ic for i = 1,, n, and c = 1,, C. M-step: For c = 1,, C and j = 1,, N, evaluate π c (l+1) (u j ) in (2.7), m (l+1) c (u j ) in (2.8) and σ 2(l+1) c0 (u j ) in (2.9). Further obtain π c (l+1) (x i ), m (l+1) c (x i ) and σc 2(l+1) (x i ) using a linear interpolation. Iteratively update the E-step and the M-step with l = 1, 2,, until the algorithm converges. It is well known that an ordinary EM algorithm for parametric models possesses an ascent property, which is a desired property. The modified EM algorithm can be regarded as an extension of the EM algorithm from parametric models to nonparametric models. Thus, it is of interest to study whether the modified EM algorithm still preserves the ascent property. Let θ (l) ( ) = {π (l) ( ), σ 2(l) ( ), m (l) ( )} be the l-th step estimated functions in the proposed EM algorithm. Denote θ 0 = (π T 0, σ2t 0, mt 0 )T. We rewrite the local log-likelihood function (2.2) as l n (θ 0 ) = l(θ 0, y i )K h (x i x 0 ). (2.10) Theorem 2 For any given point x 0 in the interval of x, suppose that θ (l) ( ) has continuous first derivative, and nh as n, h 0. It follows that [ ] lim inf n n 1 l n {θ (l+1) (x 0 )} l n {θ (l) (x 0 )} 0 (2.11) in probability. The proof of Theorem 2 is given in Section 5. Theorem 2 implies that when the sample size n is large enough, the proposed algorithm possess the ascent property at any given x 0. In particular, at the set of grid points {u 1,, u N } where the unknown curves are evaluated, we have for large n, l n {θ (l+1) (u j )} l n {θ (l) (u j )} 0.

10 NONPARAMETRIC MIXTURE OF REGRESSIONS 9 3. Simulation and Application In this section, we address some practical implementation issues such as standard error formula and bandwidth selection for nonparametric mixture of regression models. To assess the performance of the estimates of the unknown regression functions m c (x), we consider the square root of the average square errors (RASE) for mean functions, RASE 2 m = N 1 C c=1 j=1 N { ˆm c (u j ) m c (u j )} 2, where {u j, j = 1,, N} is the grid points at which the unknown function m c ( ) are evaluated. For simplification, the grid points are taken evenly on the range of the x-variable. Similarly, we can define RASE for variance functions σc 2 (x)s and proportion functions π c (x)s, denoted by RASE σ and RASE π, respectively Standard Error Formula Define the fitted value for the i-th observation as a weight sum of the estimated means, C ŷ i = r ic ˆm c (x i ), c=1 where r ic is given (2.3) when the effective EM algorithm converges. Then, the residual is e i = y i ŷ i. Rewrite the estimate of m c (x) in the proposed algorithm as ˆm c (x) = (E T W c E) 1 E T W c y, where E is a n 1 vector with all entries equal to 1, and W c = diag{w c1,, w cn } with w ic = r ic K h (x i x 0 ). We consider the following approximate standard error formula for ˆm c (x): Var{ ˆm c (x)} = (E T W c E) 1 E T W c Ĉov(y)W c E(E T W c E) 1, (3.1) where Ĉov(y) = diag{e2 1, e2 2,, e2 n}, a diagonal matrix consisting of the residuals squares e 2 i. Furthermore, (3.1) can be written as Var{ ˆm c (x)} = n w2 ic e2 i ( n w ic) 2. (3.2)

11 10 MIAN HUANG AND RUNZE LI 3.2. Bandwidth Selection Bandwidth selection is fundamental to nonparametric smoothing. In this section we propose a multifold cross-validation (CV) method to choose the bandwidth. Denote D to be the full data set. We then partition D into a training set R j and testing set T j, D = T j R j j = 1,, J. We use the training set R j to obtain the estimates { ˆm c ( ), ˆσ 2 c ( ), ˆπ c ( )}. Then we can estimate m c (x), σ 2 c (x) and π c (x) for the data points belong to the corresponding testing set. For (x l, y l ) T j, ˆm c (x l ) = {i:x i R j } r ick h (x i x l )y i {i:x i R j } r ick h (x i x l ), ˆσ 2 c (x l ) = ˆπ c (x l ) = {i:x i R j } r ick h (x i x l ){y i ˆm c (x i )} 2 {i:x i R j } r, ick h (x i x l ) {i:x i R j } r ick h (x i x l ) {i:x i R j } K h(x i x l ). Based on the estimated ˆm c (x l ) of test set T j, we again calculate the probability memberships in test set T j. For (x l, y l ) T j, c = 1,, C, r lc = ˆπ c (x l )φ{y l ˆm c (x l ), ˆσ 2 c (x l )} C q=1 ˆπ q(x l )φ{y l ˆm q (x l ), ˆσ 2 q(x l )}. Now we can implement regular CV criterion in the proposed model, J CV = (y l ŷ l ) 2, (3.3) j=1 l T j where ŷ l = C c=1 r lc ˆm c (x l ) is the predicted value of y l in the test set T j Simulation Study In the following example, we conduct a simulation for a 2-component nonparametric mixture of regressions model with π 1 (x) = exp(0.5x)/{1 + exp(0.5x)}, and π 2 (x) = 1 π 1 (x), m 1 (x) = 4 sin(2πx), and m 2 (x) = cos(3πx), σ 1 (x) = 0.6 exp(0.5x), and σ 2 (x) = 0.5 exp( 0.2x). For each of the samples, 500 simulations were conducted with sample sizes n = 200, 400, 800. The predictor x is generated from one dimension uniform distribution in [0, 1]. The Epanechnikov kernel is used in our simulation.

12 NONPARAMETRIC MIXTURE OF REGRESSIONS 11 Table 3.1: RASEs: Mean and Standard Deviation RASE m RASE σ 2 RASE π n h Mean(Std) Mean(Std) Mean(Std) (0.0964) (0.0918) (0.0393) (0.0682) (0.0706) (0.0325) (0.0992) (0.1431) (0.0237) (0.0723) (0.0622) ) (0.0659) (0.0517) (0.0276) (0.0683) (0.0552) (0.0190) (0.0562) (0.0465) (0.0312) (0.0440) (0.0476) (0.0218) (0.0368) (0.0395) (0.0149) Table 3.2: Standard error of the unknown mean functions m 1(x) m 2(x) n x SD SE(Std) SD SE(Std) (0.0371) (0.0315) (h=0.08) (0.0468) (0.0338) (0.0247) (0.0222) (h=0.06) (0.0290) (0.0256) (0.0190) (0.0164) (h=0.04) (0.0210) (0.0154) It is well known that the EM algorithm may be trapped by a local maximizer. Thus, the EM algorithm may be sensitive to the initial value. To obtain a good initial value, we first fit a mixture of polynomial regression models, and obtain the estimates of mean functions m c (x), and parameters σ c 2, π c. Then we set the initial values m (1) c (x) = m c (x), σ 2(1) (x) = σ c 2, and π (1) (x) = π c. To demonstrate that the proposed procedure works quite well for a wide range of bandwidths, we use the following strategy to determine the bandwidth rather than use the crossvalidation bandwidth selector for each generated sample in our simulation study to save computing time. In our simulation, we first generate several simulation data sets for a given sample size, and then use the CV bandwidth selectors to choose a bandwidth for each data set. This provides us an idea about the c

13 12 MIAN HUANG AND RUNZE LI 8 7 true mean functions 6 Estimated mean functions, true mean functions and confident intervals estimated mean functions true mean functions 95% conficent intervals y 3 y x (a) x (b) Figure 3.2: (a) A typical sample of simulated data (n=400), and the plot of true mean functions; (b) The estimated mean functions (solid lines), true mean functions (dash lines), and 95% pointwise confidence intervals (dotted lines) with n = 400, h = optimal bandwidth for a given sample size. Then we consider three different bandwidths: half of the selected bandwidth, the selected bandwidth, and twice of the selected bandwidth, which corresponds to the under-smoothing, appropriate smoothing and over-smoothing, respectively. Table 3.1 displays the mean and standard deviation of RASEs over 500 simulations. From Table 3.1, the proposed procedure performs quite well for a wide range of bandwidths, We next test the accuracy of the standard error formulas. Table 3.2 summarizes the simulation results for the unknown functions m c (x) at points 0.30 and The standard deviation of 500 estimates, denoted by SD, can be viewed as the true standard errors. We then calculate the sample average and standard deviation of the 500 estimated standard errors using the proposed standard error formulas (3.1), denoted by SE(Std). The result shows that although underestimation is presented, the proposed sandwich formula works reasonably well because the difference between the true value and the estimate is less than twice of the standard error of the estimate. Now we illustrate the performance of the proposed procedure by using a typical simulated sample with n = 400, which is selected to be the one with the median of RASE m in the 500 simulation samples. Figure 3.2(a) shows the

14 NONPARAMETRIC MIXTURE OF REGRESSIONS 13 scatter plot of the typical sample data and the true mean functions. Before we conduct analysis using nonparametric mixture of regression models for this data set, we first briefly discuss how to determine the number of components C. Note that conditioning on X = x, the nonparametric mixture of regression models (2.1) is indeed an univariate mixture of normals. For a local area of x, m c (x) may be consider as a constant, thus to choose C for nonparametric mixture of regression models, we suggest an analysis based on modeling the response valuable y of the local data in univariate mixture of normals. An illustration is given in Figure 3.2(a). We select the local area with relative large variance in y direction, which is shown in the rectangle. Traditional methods can be used to determine C. We fit the local data using univariate mixture of normals with one, two, three, and four components. The BIC for each model is 83.29, 64.19, 68.43, and Thus, a two component model is selected. We use the crossvalidation (CV) criterion proposed in Section 3.2 to select a bandwidth. The CV bandwidth selector yields the bandwidth The resulting estimate with n = 400 along with its pointwise confidence interval is depicted in Figure 3.2(b), from which we can see that the true mean functions lies within the confidence interval. This implies that the proposed estimation procedure performs quite well with the moderate sample size An Application We illustrate the proposed model and estimation procedure by an analysis of a real dataset, which contains the monthly change of SP-Case Shiller House Price Index (HPI) and monthly growth rate of United States Gross Domestic Product (GDP) from Jan, 1990 to Dec, It is known that in the literature of economic research, HPI is a measure of a nation s housing cost and GDP is a measure of the size of nation s economy. It is of interested to investigate the impact of GDP growth rate on HPI change. As expected, the impact of GDP growth rate on HPI may have different pattern in different macroeconomic cycles. In this illustration, we set HPI change to be the response variable, and the GDP growth rate to be predictor. The scatterplot of this data set is depicted in Figure 3.3(a). We first determine the number of component C using local data within the rectangle of Figure 3.3(a). The BIC for univariate mixture of normals models

15 14 MIAN HUANG AND RUNZE LI with one, two and three components are 13.46, 2.34 and 2.57, respectively. Thus, a two component model is selected. We next select the bandwidth by a 5-fold CV selector described in (3.3). The selected optimal bandwidth is With this optimal bandwidth we fit the data in a 2-component nonparametric mixture of regression models. The estimated mean functions are shown in Figure 3.3(b). We further depicted the hard-clustering result, which is obtained by assigning component identities according to the largest r ic, c = 1,, C. The triangular points from the lower cluster are mainly from January 1990 to September The circle points in the upper cluster are mainly from October 1997 to December 2002, during which the economy experienced an internet boom and the following bust. From the result we observed that in the upper component, HPI change tend to be lower when GDP growth is modest. The estimated mixing proportion function is plotted in Figure 3.3(c), and the estimated variance functions are plotted in Figure 3.3(d). We compare the proposed nonparametric finite mixture of regression models to two other models: a local linear regression model and 2-component mixture of linear regression models. The optimal bandwidth for local linear regression was selected by a plug-in method (Ruppert et al., 1995), and the mixture of regression models will be estimated by an EM algorithm. We randomly select 5% of the data to be a test set and the rest as a training set. Estimations for 3 different models are obtained from training set and prediction errors are calculated based on the deviations in test set. Our result shows that in 100 simulations, the average prediction error of nonparametric mixture of regression models, mixture of linear regression models, and local linear regression are , , and , respectively. The result implies that the proposed model has 7.26% reduction in prediction error compared to mixture of linear regression models, and has 57.43% reduction in prediction error compared to local linear regression. The improved prediction error demonstrates the advantages of nonparametric finite mixture of regression model in the analysis of US house price index data. 4. Discussion In this paper, we proposed a class of nonparametric finite mixture of regression models, which allows the mixing proportions, the mean functions and the variance functions all are nonparametric functions of the covariate. We proposed

16 NONPARAMETRIC MIXTURE OF REGRESSIONS 15 Scatter Plot Estimation and hard clustering HPI Change 0.4 HPI Change GDP Growth GDP Growth Rate (a) (b) Estimated Proportion Function Proportion function of upper component 0.03 Estimated Variance Functions upper component lower component Proportion Function Variance Functions GDP Growth Rate GDP Growth Rate (c) (d) Figure 3.3: (a) Scatterplot of US house price index data; (b)estimated mean functions with 95% confident interval and hard-clustering result; (c)estimated mixing proportion function; (d)estimated variance functions.

17 16 MIAN HUANG AND RUNZE LI an estimation procedure for the newly proposed model by kernel regression, and further studied the sampling properties of the resulting estimate. We derived its asymptotic bias and asymptotic variance, and further established its asymptotic normality. We proposed an effective EM algorithm to calculate the local maximum likelihood estimate. We also studied the ascent property of the proposed algorithm. We founded that for large sample, the effective EM algorithm still possesses the monotone ascent property. The performance of the proposed procedures was empirically examined by a Monte Carlo simulation and illustrated by an analysis of a real data example. In this paper, we focus on estimation of the newly proposed nonparametric finite mixture of regression models when x 0 is an interior point in the range of covariate. It is certainly of interest to study the boundary effect of the proposed procedure. The boundary effect has been studied in Cheng, Fan and Marron (1997) for the nonparametric regression model. It is of interest to conduct testing hypothesis whether the mixing proportions are constants, whether some of mean functions are constant or of a specific parametric forms and whether the variance functions are parametric. These can be for further researches. 5. Proofs In this section we outline the key steps for proofs of Theorems 1 and 2. Note that θ = (π T, σ 2T, m T ) T is a (3C 1) 1 vector. Whenever necessary, we rewrite θ = (θ 1,, θ 3C 1 ) T without changing the order of π, σ 2, and m. Otherwise, we will use the same notations as we defined in section 2. Regularity Conditions A. The sample {(x i, y i ), i = 1,, n} is independent and identically distributed from its population (x, y). The support for x, denoted by X, is closed and bounded of R 1. The density η{y θ(x 0 )} is identifiable up to a permutation of the mixture components. B. The marginal density function f(x) is continuous first derivative and positive for x X. C. There exists a function M(x, y), with E{M(X, Y )} <, such that 3 l(θ, y)/ θ j θ k θ l M(x, y), for all x, y, and all θ in a neighborhood of θ(x 0 ). D. The unknown functions θ(x) has continuous second derivative.

18 NONPARAMETRIC MIXTURE OF REGRESSIONS 17 E. The kernel function K( ) has a bounded support, and satisfies that K(t)dt = 1, tk(t)dt = 0, t 2 K(t)dt = κ 2 > 0. F. The following conditions hold for all i and j. ( l(θ(x 0 ), Y ) 3) [ { E 2 } 2 ] l(θ(x 0 ), Y ) <, E <. θ j θ i θ j All these conditions are mild conditions and have been used in the literature of local likelihood estimation and mixture models. used in the proof of Theorem 1. The following lemma will be Lemma 1 Under Conditions A, C, and F, for x 0 in the interior of X, it holds that E[q 1 {θ(x), Y } X = x 0 ] = 0, (5.1) E[q 2 {θ(x), Y } X = x 0 ] = E[q 1 {θ(x), Y }q T 1 {θ(x), Y } X = x 0 ]. (5.2) Proof. Conditioning X = x 0, Y follows a finite mixtures of normals. Thus, (5.1) holds by some calculations. Furthermore, (5.2) follows by using under regularity conditions C, F and the arguments in Page 39 of McLachlan and Peel (2000) together. This completes the proof of this lemma. We refer (5.1 and (5.2) to as the local Barlette s first and second identities, respectively. (5.1) implies that Λ(x 0 ) = 0. Proof of Theorem 1. Denote m c = nh{m c0 m c (x 0 )}, σ 2 c = nh{σ 2 c0 σ 2 c (x 0 )}, π c = nh{π c0 π c (x 0 )}, πc = nh{π C0 π C (x 0 )} = C 1 nh[1 {π c0 π c (x 0 )}]. Let m = (m 1,, m C )T, σ 2 = (σ1 2, σ2 C )T, and π = (π1,, π C 1 )T. Denote θ = (π T, σ 2 T, m T ) T. Recall that { C } l(θ(x 0 ), y) = log η{y θ(x 0 )} = log π c (x 0 )φ{y m c (x 0 ), σc 2 (x 0 )}. c=1 c=1

19 18 MIAN HUANG AND RUNZE LI Let { C l(θ(x 0 ) + γ n θ, y) = log (π c (x 0 ) + γ n πc ) c=1 φ ( y m c (x 0 ) + γ n m c, σc 2 (x 0 ) + γ n σc 2 ) }. Thus, if {ˆπ, ˆσ 2, ˆm} maximizes (2.2), then ˆθ maximizes l n(θ ) = h {l(θ(x 0 ) + γ n θ, y i ) l(θ(x 0 ), y i )}K h (x i x 0 ). (5.3) By the Taylor expansion, l n(θ ) = n θ θ T Γ n θ + hγ3 n 6 R(θ(x 0 ), ˆξ), (5.4) where ˆξ is a vector between 0 and γ n θ, and h n = q 1 {θ(x 0 ), y i }K h (x i x 0 ), n Γ n = 1 q 2 {θ(x 0 ), y i }K h (x i x 0 ), n R(θ(x 0 ), ˆξ) = j,k,l 3 l(θ(x 0 ) + ˆξ, y i ) θ j θ k θ l K h (x i x 0 ). Denote Γ n (i, j) the (i, j) element of Γ n. By condition E, it can be shown that 2 l(θ(x 0 ), y) EΓ n (i, j) = η{y θ(x)}f(x)k h (x x 0 )dxdy Y X θ i θ j 2 l(θ(x 0 ), y) = f X (x 0 ) η{y θ(x 0 )}dy + o p (1). θ i θ j Y Therefore, EΓ n = f X (x 0 )I(x 0 ) + o p (1). Var{Γ n (i, j)} is dominated by the following term 1 n Y X { 2 l(θ(x 0 ), y) θ i θ j } 2 η{y θ(x)}f(x)k 2 h (x x 0)dxdy, which can be shown to have the order O p {(nh) 1 } under condition F. Therefore, we have Γ n = f X (x 0 )I(x 0 ) + o p (1).

20 NONPARAMETRIC MIXTURE OF REGRESSIONS 19 By Condition C, the expectation of the absolute value of the last term of (5.4) is bounded by ( O γ n E max j,k,l 3 l(θ(x 0 ) + ˆξ, ) Y ) K h (x i x 0 ) = O(γ n ). (5.5) θ j θ k θ l Thus, the last term of (5.4) is of order O p (γ n ). Therefore, we have l n(θ ) = n θ 1 2 f(x 0)θ T I(x 0 )θ + o p (1). (5.6) Using the quadratic approximation lemma (for example, see p.210 of Fan and Gijbels, 1996), we have ˆθ = f(x 0 ) 1 I(x 0 ) 1 n + o p (1). (5.7) To establish asymptotic normality, it remains to calculate the mean and variance of n, and verify the Lyapounov condition. Note that E( n ) = nh q 1 {θ(x 0 ), y}η{y θ(x)}f(x)k h (x x 0 )dxdy Y X = nh Λ(x)f(x)K h (x x 0 )dx. X Under Conditions C, D and F, Λ(x) has continuous second derivative. Thus, using the fact Λ(x 0 ) = 0 by Lemma 1 and standard arguments in the kernel regression, it follows that { f (x 0 )Λ (x 0 ) E( n ) = f(x 0 ) For the covariance term of n, we have + 1 } 2 Λ (x 0 ) κ 2 h 2 + o(h 2 ). Cov( n ) = he { q 1 {θ(x 0 ), Y }q T 1 {θ(x 0 ), Y }K 2 h (X x 0) } + o p (1), where its (i, j) element is l(θ(x 0 ), y) l(θ(x 0 ), y) h Kh 2 Y X θ i θ (x x 0)f(x)η{y θ(x)}dxdy j P l(θ(x 0 ), y) l(θ(x 0 ), y) f(x 0 )ν 0 η{y θ(x 0 )}dy Y θ i θ j 2 l(θ(x 0 ), y) = f(x 0 )ν 0 η{y θ(x 0 )}dy. θ i θ j Y

21 20 MIAN HUANG AND RUNZE LI The last step holds due to (5.2). Thus, Cov( n ) = f(x 0 )I(x 0 )ν 0 + o p (1). In order to establish asymptotic normality for n, it is necessary to show for any unit vector d, {d T Cov( n )d} 1/2 d T { n E( n )} D N(0, 1). (5.8) Since Cov( n ) = O p (1), it follows that {d T Cov( n )d} = O p (1). Let λ i = d T q 1 {θ(x 0 ), y i }K h (x i x 0 ), then d T n = hγ n n λ i. Then, it is sufficient to show that nh 3 γ 3 ne( λ i 3 ) = o p (1). By condition F and arguments similar to (5.5), it can be shown that nh 3 γ 3 ne( λ i 3 ) = O p (γ n ) = o p (1), and thus Lyapounov s condition holds for (5.8). By (5.7) and the Slutsky theorem, we have nh{γnˆθ B(x0 ) + o p (h 2 )} D N { 0, f 1 (x 0 )ν 0 I 1 (x 0 ) }. (5.9) Proof of Theorem 2 We assume the unobserved data (C i, i = 1,, n) are random samples from population C, and the complete data {(x i, y i, C i ), i = 1, 2,, n} are random samples from (X, Y, C). Let h{y, C θ(x)} be the joint pdf of (Y, C) given X = x, and f(x) be the marginal density of X. Conditioning on X = x, Y follows a distribution η{y θ(x)}. Then the conditional density of C is given by g{c y, θ(x)} = h{y, c θ(x)} η{y θ(x)}. (5.10) The local log-likelihood function (2.2) can be re-written as l n (θ 0 ) = log{η(y i θ 0 )}K h (x i x 0 ). (5.11) Given a fixed ˆθ (l) (x i ), i = 1,, n, it is obvious that g{c y i, ˆθ (l) (x i )}dc = 1. Then local likelihood (5.11) is { l n (θ) = log{η(y i θ)} = { By (5.10), we also have } g{c y i, ˆθ (l) (x i )}dc K h (x i x 0 ) } log{η(y i θ)}g{c y i, ˆθ (l) (x i )}dc K h (x i x 0 ). (5.12) log{η(y i θ)} = log{h(y i, c θ)} log{g(c y i, θ)}. (5.13)

22 NONPARAMETRIC MIXTURE OF REGRESSIONS 21 Thus, we have l n (θ) = E{log{h(y i, C θ)} θ (l) (x i )}K h (x i x 0 ) E{log{g(C y i, θ)} θ (l) (x i )}K h (x i x 0 ), (5.14) where θ (l) (x i ) is the l-th step local estimations at x i. Taking the expectation is equivalent to calculating (2.3). In M step, we choose θ (l+1) (x 0 ) such that 1 n 1 n { } E log[h{y i, C θ (l+1) (x 0 )}] θ (l) (x i ) K h (x i x 0 ) { } E log[h{y i, C θ (l) (x 0 )}] θ (l) (x i ) K h (x i x 0 ). It suffices to show that [ { } 1 ] g{c y i, θ (l+1) (x 0 )} lim inf E log n n g{c y i, θ (l) (x 0 )} θ(l) (x i ) K h (x i x 0 ) 0 (5.15) in probability. Denote [ { } L g = 1 ] g{c y i, θ (l+1) (x 0 )} E log n g{c y i, θ (l) (x 0 )} θ(l) (x i ) K h (x i x 0 ), and L J = 1 n [ { g{c y i, θ (l+1) }] (x 0 )} log E g{c y i, θ (l) (x 0 )} θ(l) (x i ) K h (x i x 0 ). By Jensen s inequality, L g L J. Next we show that L J 0 in probability. To this end, we first calculate the expectation of L J. ( [ ] ) g{c Y, θ (l+1) (x 0 )} E(L J ) = E log g{c Y, θ (l) (x 0 )} g{c Y, θ(l) (X)}dc K h (X x 0 ), C which tends to 0 by a standard argument. We next calculate the variance of L J. Note that the variance of L J is dominated by the following term ( [ ] 2 1 n E g{c Y, θ (l+1) (x 0 )} log g{c Y, θ (l) (x 0 )} g{c Y, θ(l) (X)}dc K h (X x 0 )), C

23 22 MIAN HUANG AND RUNZE LI which can be shown to have the order O p {(nh) 1 }. Then we have L J = o p (1) by Chebyshev inequality. This completes the proof. Acknowledgments Huang s research is supported by National Science Foundation grants DMS and National Institute on Drug Abuse grants P50 DA10075 as research assistant during his graduate study at the Pennsylvania State University. Li s research is supported by National Institute on Drug Abuse grants R21 DA The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIDA or the NIH. References Cheng, M. Y., Fan, J. and Marron, J. S. (1997). On Automatic Boundary Corrections. Annals of Statistics. 25, Dempster, A. P., Laird, N. M. and Rubin, B. D. (1977). Maximum Likelihood from Incomplete Data Via EM Algorithm. Journal of American and Statisical Association. 39, Fan, J. and Gijbels, I. (1996). Local Polynomial Modelling and Its Applications. Chapmam and Hall, London. Fan, J. (1993). Local Linear Regression Smoothers and Their Minimax efficiencies. Annals of Statistics. 21, Fan, J. and Gijbels, I. (1992). Variable Bandwidth and Local Linear Regression Smoothers. Annals of Statistics. 20, Fan, J. and Yao, Q. (1998). Efficient Estimation of Conditional Variance Functions in Stochastic Regression. Biometrika. 85, Frühwirth-Schnatter, S. (2001). Markov Chain Monte Carlo Estimation of Classical and Dynamic Switching and Mixture Models. Journal of American and Statisical Association. 96, Frühwirth-Schnatter, S. (2005). Finite Mixture and Markov Switching Models. Springer.

24 NONPARAMETRIC MIXTURE OF REGRESSIONS 23 Green, P. J. and Richardson, S. (2002). Hidden Markov Models and Disease Mapping. Journal of American and Statisical Association. 97, Goldfeld, S. M. and Quandt, R. E. (1976). A Markov Model for Swiching Regression. Journal of Econometrics. 1, Hathaway, R. J. (1985). A Constrained Formulation of Maximum-likelihood Estimation for Normal Mixture Distributions. Annals of Statistics. 13, Hennig, C. (2000). Identifiability of Models for Clusterwise Linear Regression. Journal of Classification. 17, Lindsay, B. G. (1995). Mixture Models: Theory, Geometry, and Applications. Hayward, Institute of Mathematical Statistics. McLachlan, G. J. and Peel, D. (2000). Finite Mixture Models. Wiley, New York. Ruppert, D., Sheather, S. J. and Wand, M. P. (1995). An Effective Bandwidth Selector for Local Least Squares Regression. Journal of American and Statisical Association. 90, Rossi, P. E., Allenby, G. M., and McCulloch, R. (2005). Bayesian Statistics and Marketing. Chichester: Wiley. Wang, P., Puterman, M. L., Cockburn, I., and Le, N. (1996). Mixed Poisson Regression Models with Covariate Dependent Rates. Biometrics. 52, Wedel, M. and DeSarbo, W. S. (1993). A Latent Class Binomial Logit Methodology for the Analysis of Paired Comparison Data. Decision Sciences. 24, Department of Statistics, Shanghai University of Finance and Economics, Shanghai, China, ; mzh136@stat.psu.edu Department of Statistics and The Methodology Center, The Pennsylvania State University, University Park, Pennsylvania, USA, ; rli@stat.psu.edu

The EM Algorithm for the Finite Mixture of Exponential Distribution Models

Int. J. Contemp. Math. Sciences, Vol. 9, 2014, no. 2, 57-64 HIKARI Ltd, www.m-hikari.com http://dx.doi.org/10.12988/ijcms.2014.312133 The EM Algorithm for the Finite Mixture of Exponential Distribution