A Resampling Approach for Interval-Valued Data Regression

Size: px

Start display at page:

Download "A Resampling Approach for Interval-Valued Data Regression"

Caren Watson
6 years ago
Views:

1 A Resampling Approach for Interval-Valued Data Regression Jeongyoun Ahn, Muliang Peng, Cheolwoo Park Department of Statistics, University of Georgia, Athens, GA, 30602, USA Yongho Jeon Department of Applied Statistics, Yonsei University, Seoul, , Korea March 4, 2012 Abstract We consider interval-valued data that frequently appear with advanced technologies in current data collection processes. Interval-valued data refer to the data that are observed as ranges instead of single values. In the last decade, several approaches to the regression analysis of interval-valued data have been introduced, but little work has been done on relevant statistical inferences concerning the regression model. In this paper, we propose a new approach to fit a linear regression model to interval-valued data using a resampling idea. A key advantage is that it enables one to make inferences on the model such as the overall model significance test and individual coefficient test. We demonstrate the proposed approach using simulated and real data examples, and also compare its performance with those of existing methods. Key words: Interval-valued data; Linear regression; Resampling; Statistical inference. 1 Introduction Even with ever-advanced computing power, analyzing data sets of massive size is often not practical. As a consequence, aggregating information from such huge databases is routinely done. For example, a person s weekly expenditure during a certain time period can be aggregated to an interval with bounds of the lowest and highest weekly spending. More elaborate aggregation can be 1

2 made by constructing a histogram of a person s weekly expenditure. Another source of intervalvalued data can be the innate nature of a variable, for example a person s blood pressure is often measured as an interval of systolic and diastolic pressures. Interval-valued data belong to a more general class of data type called symbolic data (Diday, 2003; Billard and Diday, 2003, 2007). Typical examples of symbolic data include interval, histogram, and modal-valued data. These types of data are difficult to analyze with classical methods. A main challenge is how to account for internal variation or certain structure within an observation. There have been substantial efforts to extend classical methods to the symbolic data for various analysis problems (Diday, 1995; Diday and Emilion, 1996, 1998). For more recent developments for different types of symbolic data analysis, see Billard (2011); Noirhomme-Fraiture and Brito (2011). This paper focuses on regression analysis with interval-valued data, which is the most frequently observed symbolic data type. There have been several linear regression approaches for intervalvalued data. The first work is proposed by Billard and Diday (2000). Their main idea is to build a classical linear regression model using center points of the observed intervals and to use the fitted model for predicting the lower and upper bounds of the interval of the response variable. Also, Billard and Diday (2002) fits individual models for lower and upper bounds of the intervals and estimate the bounds of the response variable separately. However, neither of these approaches takes internal variation of the observed intervals into account in the estimation of the model. In an attempt to resolve this problem, Lima Neto et al. (2004) propose the center and range method that fits two different linear regression models for the center points and the ranges of the intervals, respectively. They also make separate predictions for the center and the range of the response variable. Billard and Diday (2007) also attempt to account for the width of intervals by developing the bivariate center and range method. Their method constructs two regression models using both the center points and ranges of intervals as the predictors. However, these approaches still have some limitations because they approximate the internal variation with the range only. Another issue sometimes encountered from these approaches is that the predicted lower bounds can be bigger than the predicted upper bounds, when a slope estimate is negative. Lima Neto and de Carvalho (2010) suggest to use constraints that guarantee the coherence between the predicted values of the lower and upper bounds. Recently, Xu (2010) proposes a symbolic covariance method to address these two issues based on the symbolic sample 2

3 covariance suggested by Billard (2007, 2008). Other notable works on interval-valued regression analysis include Alfonso et al. (2004), where they include taxonomy variables as predictor variables to account for hierarchical structures among symbolic variables, and Maia and Carvalho (2008), where they develop a robust method via a least absolute deviation regression model. These existing methods are useful for the model estimation purposes, but we note that only a few of them attempt to make a statistical inference on the estimated model. For example, Lima Neto et al. (2009) and Lima Neto et al. (2011) construct a bivariate generalized linear model based on a bivariate exponential family of distributions. Silva et al. (2011) propose a regression model using copula, which allows statistical inference under the framework of joint distributions for bivariate random vectors. In this work we propose a new regression approach for interval-valued data based on resampling. We propose to (i) generate a large number of samples by randomly selecting a single-valued point within each observed intervals; (ii) fit a classical linear regression model on each single-valued sample; (iii) calculate the mean regression coefficients over the fitted models, and then use them to produce a final linear regression model. The main contributions of the proposed method are two-fold; (i) it fully makes use of the variability of the interval-valued data; (ii) more importantly, it enables one to make inferences regarding the regression coefficients because the sampling distributions of the estimated coefficients are obtained via Monte Carlo sampling. Even though we use the linear regression to demonstrate the methodology, the proposed approach can be expanded further to build a unified framework for numerical-valued symbolic data such as interval-valued and histogram-valued data. Also, the core idea of this work can be straightforwardly applied to other statistical problems such as discrimination analysis. Section 2 reviews some of the existing linear regression methods for interval-valued data. We discuss the proposed method in Section 3. The proposed method is applied to both simulated and real examples in Sections 4 and 5, and is compared with the methods described in Section 2. The paper ends with conclusion in Section 6. 3

4 2 Regression with Interval-valued Data Let X 1,..., X p be p explanatory variables, and Y be the response variable, both of which realizations are intervals: X ij = [X Lij, X Uij ] (X Lij X Uij ) and Y i = [Y Li, Y Ui ] (Y Li Y Ui ), for i = 1,..., n and j = 1,..., p. We consider the following linear regression model: Y = Xβ + ϵ, (1) where Y = (Y 1,..., Y n ) T, X = (X 1,..., X n ) T, X i = (1, X i1,..., X ip ) T for i = 1,..., n, β = (β 0, β 1,..., β p ) T, ϵ = (ϵ 1,..., ϵ n ) T and ϵ i independently follows N(0, σ 2 ). Several linear regression approaches for interval-valued data have been proposed over recent years. Among them we discuss four existing methods in this paper for comparison purposes. Xu (2010) thoroughly illustrates the four methods and we summarize them here using the similar notations. The first work on this problem is the center method (CM) proposed by Billard and Diday (2000). The CM fits a linear regression model on the mid-points of the interval values. Let X1 c,..., Xc p be the center points of the intervals of the variables X 1,..., X p, and Y c be the center point of Y. Then, CM converts the model (1) into a classical linear regression model by Y c = X c β c + ϵ c, (2) where Y c = (Y c 1,..., Y c n ) T, X c = (X c 1,..., Xc n) T, X c i = (1, Xc i1,..., Xc ip )T for i = 1,..., n, β c = (β c 0, βc 1,..., βc p) T, ϵ c = (ϵ c 1,..., ϵc n) T. Then, β c can be estimated by the usual least squares method. Suppose a new observation (x 0 L, x0 U ) is obtained, where x0 L = (1, x0 L1,..., x0 Lp ) and x0 U = (1, x 0 U1,..., x0 Up ). The prediction of the lower and upper bounds of Y = [Y L, Y U ] that is originally given by Billard and Diday (2000) is Ŷ L = x 0 L β c, Ŷ U = x 0 U β c. (3) Xu (2010) notices that the use of (3) could yield a lower bound higher than an upper bound. Hence he suggests the following modified prediction Ŷ L = min(x 0 L β c, x 0 U β c ), Ŷ U = max(x 0 L β c, x 0 U β c ), (4) which will also be used for our proposed method. 4

5 The CM method is simple to implement and have some good empirical properties as shown in Section 4, but as Lima Neto and de Carvalho (2008) point out, it does not utilize the information from the ranges of the intervals in the estimation step. In other words, it does not take internal variations of the intervals into account. Furthermore, it is impossible to make statistical inferences on the regression coefficients because the sampling variances cannot be estimated from the procedure. In order to resolve this problem, Lima Neto et al. (2004) propose the center and range method (CRM) that fits two separate linear regression models with the center points and the ranges of the intervals, respectively. That is, they consider the ranges of the intervals in the estimation and prediction, as well as the centers. For the center model, they use the same model given in (2). For the range model, let X r 1,..., Xr p be the p ranges of the intervals of X 1,..., X p, and Y r be the range of the interval of Y. Also, let the observed values of X r j be Xr ij = (X Uij X Lij ), and the observed value of Y r i be Y r i model on the ranges is given by = (Y Ui Y Li ) where i = 1,..., n, and j = 1,..., p. Then, the linear regression Y r = X r β r + ϵ r, (5) where Y r = (Y r 1,..., Y r n ) T, X r = (X r 1,..., Xr n) T, X r i = (1, X r i1,..., Xr ip )T for i = 1,..., n, β r = (β r 0, βr 1,..., βr p) T and ϵ r = (ϵ r 1,..., ϵr n) T. Again, β r can be estimated by the usual least squares estimation. Then, separate predictions are made for the center and the range of the response variable. Finally, the predicted interval Ŷ = [ŶL, ŶU] is given by Ŷ L = Ŷ c Ŷ r /2, Ŷ U = Ŷ c + Ŷ r /2, where Ŷ c and Ŷ r are the predicted values under the models (2) and (5), respectively. The CRM counts on the assumption of independence between mid-points and ranges, which may not hold in general. Also, even though the variations in the intervals are explained through the ranges, it is not clear how they are transferred to the variation of the estimated regression coefficients. Billard and Diday (2007) also attempt to account for the width of intervals by developing the bivariate center and range method (BCRM). As in CRM, their method constructs two separate regression models for predicting both the center points and ranges of intervals. But, BCRM utilizes both the centers and ranges as predictors in the two models. These two models can be expressed 5

6 as follows: Y c = X cr β c + ϵ c, Y r = X cr β r + ϵ r, where X cr = (X cr 1,..., Xcr n ) T, X cr i = (1, Xi1 c,..., Xc ip, Xr i1,..., Xr ip )T for i = 1,..., n. Note that the number of columns in the design matrix X cr is 2p + 1 for this model while it is p + 1 for CRM. The estimation and prediction can be similarly done as in CRM. Since the BCRM has more predictors than CM and CRM, it is expected to have higher explaining power for the current data set. However, as we will see in Section 4, its prediction accuracy for future data is not necessarily better than other methods. Also again, statistical inference for the estimated models is not considered. The three approaches presented so far apply a classical regression modeling by converting interval-valued data set into a classical one. Xu (2010) proposes the symbolic covariance method (SCM) that uses the symbolic covariance matrix proposed by Billard (2007, 2008). He considers the following model with centered variables, for which the least squares estimator is given by Y Ȳ = (X X)β + ϵ, β = {(X X) T (X X)} 1 (X X) T (Y Ȳ) = S 1 XX S XY, (6) where S XX is the symbolic sample variance-covariance matrix of the predictors and S XY is the vector of the symbolic sample covariance between Y and the predictors. The symbolic sample covariance between interval-valued variables X j and X k is defined as follows (Billard, 2007, 2008): Cov(X j, X k ) = (6n) 1 n i=1 [2(X Lij X j )(X Lik X k ) + (X Lij X j )(X Uik X k ) +(X Uij X j )(X Lik X k ) + 2(X Uij X j )(X Uik X k )], (7) where the symbolic sample mean of X j is defined as (Bertrand and Goupil, 2000) n X j = (2n) 1 (X Lij + X Uij ). 3 Proposed Approach i=1 In this section, we propose a new method, called Monte Carlo method (MCM), that utilizes a resampling idea (Good, 2006) to fit a linear regression model on the interval-valued data. It not 6

7 only accounts for the internal variation of the intervals but also derives approximate sampling distributions of the estimated regression coefficients via Monte Carlo resampling. Note that an interval-valued data set with p predictors and one response variable can be represented with n hypercubes in (p + 1)-dimensional space. Then, the MCM procedure is implemented as follows: For b = 1,..., B, 1. Generate a vector-valued multivariate random sample S b of size n in R p+1, in which each vector is randomly generated from a multivariate uniform distribution on each of the n hypercubes. In other words, using the uniform distribution, randomly generate a single-valued data point X b ij from the interval of the jth predictor X ij = [a ij, b ij ] and Y b i from the interval of the response variable Y i = [c i, d i ], respectively for i = 1,..., n and j = 1,..., p, and obtain a random vector (Yi b, Xi1 b,..., X b ip )T. 2. Construct the bth random single-valued regression sample, Y b = (Y b 1,..., Y b n ) T, X b j = (X b 1j,..., X b nj )T for j = 1,..., p. 3. Obtain the bth regression coefficient estimates with the regression sample in Step 2 above via the least squares estimation: ˆβ b = {(X b ) T (X b )} 1 (X b ) T Y b, i.e., where X b = (1, X b 1,..., X b p ) T and β b = ( β b, β b,..., β b p ) T. 0 The final estimated model for the interval-valued data is obtained by averaging B estimates, Ŷ = β 0 + β 1X β px p, 1 where for j = 0,..., p. β j = 1 B B b=1 β b j, Note that we choose the name Monte Carlo method for the proposed method because singlevalued data sets are randomly generated from the observed intervals. The choice of B is dependent on the sizes of the hypercubes, since we want the randomly selected points to cover most of the volume of the hypercubes. We use B = 1000 for the examples in this paper. 7

8 In order to predict the lower and upper bounds of Y, we use the equation (4). The use of this equation implies that the proposed method, as well as CM and SCM, assumes that the bounds of the response and predictors have the same relationship as their centers. This assumption might not be necessarily true in general, however, our empirical studies in Sections 4 and 5 show competitive performance of these methods, even in the settings where the assumption is clearly violated. The most important contribution of the proposed method is that one can conduct statistical significance test on the regression coefficients. A test for overall model significance, i.e., for the hypothesis H 0 : β 1 = = β p = 0, can be constructed as follows. 1. Calculate the overall model significance F test statistic for each resampling data set. The bth F -statistic is given by where RSS b = n i=1 (Y b i F b = SS b reg/p RSS b /(n p 1), Ŷ i b ) 2 and SSreg b = n i=1 (Ŷ b i Ȳ b ) 2. Note that F b follows an F distribution with p and (n p 1) degrees of freedom under H Calculate the Z-statistic Z F = F M F VF, (8) where F = 1 B B b=1 F b, M F = n p 1 n p 3, V F = 2(n p 1)2 (n 3) p(n p 3) 2 (n p 5). 3. The p-value for the overall model significance test is given as P ( Z z F ) where Z N(0, 1) and z F is the observed value from the data. We note that if U follows an F distribution with the degrees of freedom (ν 1, ν 2 ), then E(U) = ν 2 ν 2 2, V (U) = 2ν2 2 (ν 1 + ν 2 2) ν 1 (ν 2 2) 2 (ν 2 4). Therefore, Z F in (8) approximately follows the standard normal distribution. We can also construct 100(1 α)% confidence intervals for the coefficients. The standard error of β j is calculated as SE( β j) = B b=1 ( β b j β B 1 j) 2, 8

9 and then the 100(1 α)% confidence interval for the regression coefficient β j is given by β j ± z α/2 SE( β j), (9) where z α represents the 100(1 α)% quantile of the standard normal distribution. This approach would work well if the distribution of the B coefficient estimates is approximately normally distributed. One can examine the normality of the estimates using a histogram and Q-Q plot. In addition to providing standard errors and confidence intervals, MCM can conduct statistical hypothesis tests for the individual regression coefficients, that is, H 0 : β j = 0 for j = 1,..., p. The test statistics can then be calculated as Z j = β j, (10) SE( β j) and Z j approximately follows N(0, 1) under the null hypothesis. Then, the p-value can be found using the standard normal distribution. 4 A simulation study In this section, we compare the five methods, CM, CRM, BCRM, SCM, and the proposed MCM, via simulated data sets, which are generated in the following way. Let X 1, X 2, X 3 be three intervalvalued explanatory variables, and Y be the interval-valued response variable. 1. Suppose that X c ij is the center of jth explanatory variable, and Y c i variable of the ith observation. Assume that X c i1, Xc i2, Xc i3 and Y c i follows: We consider two sets of regression coefficients, β T is the center of response have a linear relation as Y c i = β 0 + β 1 X c i1 + β 2 X c i2 + β 3 X c i3 + ϵ c i. (11) := (β 0, β 1, β 2, β 3 ) = (0, 0.3, 0, 0.5) and β T = (0, 0.3, 0, 0.5). The error distribution follows ϵ c i N(0, σ2 c ) where σ c = For the ith observation, randomly generate X c 1j uniformly from {11, 12,..., 20}, Xc 2j from {21, 22,..., 30}, and X c 3j model in (11). Here, we set n = 30. from {31, 32,..., 50}, for i = 1,..., n. Also generate Y c i from the 9

10 3. Suppose that X r ij is the range of jth explanatory variable, and Y r i is the range of response variable of the ith observation. For the ith observation, generate Xi1 r randomly from U(1, 2), X r i2 from U(2, 3), Xr i3 scenarios: from U(3, 4), for i = 1,..., n. For Y r i s, we consider three different (i) Settings I and II: independently generate them from U(1, 2) (ii) Settings III and IV: generate them from the following relationship: Y r i = γ 0 + γ 1 X r i1 + γ 2 X r i2 + γ 3 X r i3 + ϵ r i. Here, (γ 0, γ 1, γ 2, γ 3 ) = (0, 0.1, 0, 0.2) and ϵ r i N(0, σ2 r) where σ r = 0.5. (iii) Settings V and VI: for training data sets, independently generate them from U(3, 5) if the amplitudes of the corresponding Yi c s belong to top 30%; otherwise, from U(1, 2). For testing data sets, we use top 10% instead of 30% to implement a slight sampling bias. This mixture setting is motivated by the Bats species data set analyzed in Section Calculate the bounds of the ith observation as [Xij c Xr ij, Xc ij + Xr ij ]. Similarly, calculate the bounds of Y i as [Y c i Yi r, Y i c + Yi r]. In order to assess the performance of the five methods we use various criteria from existing literature. We generate independent testing sets with the sample size 30 to calculate the following criteria. The lower bound root mean-square error (RMSE L ) and the upper bound root mean-square error (RMSE U ) proposed by Lima Neto and de Carvalho (2008) measure the differences between the predicted values [ŶL i, ŶU i ] and the observed values [Y Li, Y Ui ]: RMSE L = n i=1 (Y L i ŶL i ) 2 n and RMSE U = n i=1 (Y U i ŶU i ) 2. n The symbolic correlation coefficient, r, proposed by Billard (2007, 2008) and applied to the regression problem by Xu (2010), measures the correlation between the predicted values [ŶL i, ŶU i ] and the observed values [Y Li, Y Ui ]: r(y, Ŷ ) = Cov(Y, Ŷ ), S Y SŶ where Cov(Y, Ŷ ) is the symbolic covariance between Y and Ŷ in equation (7). The S Y and SŶ are the standard deviations of Y i and Ŷi respectively, which can be computed as the following (Bertrand 10

11 and Goupil, 2000): for the interval-valued random sample Y i = [a i, b i ], i = 1,..., n of the variable Y, the symbolic sample variance is [ n n 2 SY 2 = (3n) 1 i=1(a 2 i + a i b i + b 2 i ) (2n) 2 (a i + b i )]. i=1 We summarize the simulation results in Tables 1 6. In each table, we report the median of regression coefficients from 100 repetitions along with the standard errors using the median absolute deviation. We also report the median of the aforementioned evaluation measures calculated with the testing data at the bottom part of each table. Overall, the coefficient estimates by all methods are similar to one another and reasonably close to the true coefficients, with an exception of SCM when β 3 is negative. In terms of the prediction measures, MCM and CRM are the best in all settings. We note that BCRM shows poor prediction performance with separate testing data sets, possibly due to over-parametrization. The simplest method CM performs reasonably well in general, however, since it does not take care of the issue of switching lower and upper bounds, it shows inferior performance when there is a negative coefficient. Since CRM fits both the center and the range models, it is supposed to be the best in Settings III and IV whose expected results are shown in Tables 3 and 4. In these cases, it is notable that the proposed MCM performs similarly to CRM. In Settings V and VI where the data contain the bimodal structure and mild sampling bias, MCM works best in almost all measures as seen in Tables 5 and 6. In particular, compared to CRM that shows the superiority in the other settings, MCM yields higher r and lower RMSE values for both settings. This demonstrates that the proposed method is a robust choice in various circumstances. It may not be the best performer for a single setting, but it outperforms the others in one way or another. As explained in Section 3, the most important contribution of the proposed MCM is that it can make statistical inference for the estimated model in a natural way. Table 7 displays the median of the overall model significance test statistics, Z-statistics, and p-values for individual significance tests, from 100 repetitions along with standard errors. The MCM provides correct inference results for all cases; it correctly rejects F -test with very small p-values, rejects H 0 : β 1 = 0 and H 0 : β 3 = 0, and does not reject H 0 : β 2 = 0. The p-values for H 0 : β 1 = 0 are higher compared to those for H 0 : β 3 = 0 since the magnitude of the true β 1 is closer to zero. Using these statistical inference results, one can identify important variables and further increase prediction accuracy by removing 11

12 insignificant variables from the regression model. We also run another simulation with a null model whose coefficients are all zero under Setting I. Its results are also summarized at the bottom of Table 7. The median p-value of the Z F test for model significance is , thus we successfully do not reject the null hypothesis of H 0 : β 1 = β 2 = β 3 = 0. Table 1: Setting I, where Y r is independent of Xj r, and β = (0, 0.3, 0, 0.5)T. Medians of estimates and measures are shown with standard errors. MCM CM CRM BCRM SCM β βc βc βr βc βr β β (0.0204) (0.0212) (0.0212) (0.0221) (0.0229) (0.0042) (0.0222) β (0.0138) (0.0166) (0.0166) (0.0220) (0.0158) (0.0053) (0.0124) β (0.0104) (0.0119) (0.0119) (0.0217) (0.0121) (0.0019) (0.0117) β (0.1188) (0.0204) β (0.0901) (0.0232) β (0.1335) (0.0221) r (0.0082) (0.0085) (0.0089) (0.0106) (0.0087) RMSE L (0.0531) (0.0535) (0.0493) (0.0523) (0.0538) RMSE U (0.0504) (0.0523) (0.0498) (0.0539) (0.0469) 12

13 Table 2: Setting II, where Y r is independent of Xj r, and β = (0, 0.3, 0, 0.5)T. Medians of estimates and measures are shown with standard errors. MCM CM CRM BCRM SCM β βc βc βr βc βr β β (0.0193) (0.0212) (0.0212) (0.0221) (0.0229) (0.0042) (0.0201) β (0.0118) (0.0166) (0.0166) (0.0220) (0.0158) (0.0052) (0.0152) β (0.0101) (0.0119) (0.0119) (0.0217) (0.0121) (0.0019) (0.0110) β (0.1188) (0.0204) β (0.0901) (0.0232) β (0.1335) (0.0221) r (0.0072) (0.0092) (0.0068) (0.0104) (0.0099) RMSE L (0.0563) (0.0659) (0.0493) (0.0528) (0.0523) RMSE U (0.0479) (0.0933) (0.0498) (0.0509) (0.0539) 13

14 Table 3: Setting III, where Y r is linear with Xj r, and β = (0, 0.3, 0, 0.5)T. Medians of estimates and measures are shown with standard errors. MCM CM CRM BCRM SCM β βc βc βr βc βr β β (0.0201) (0.0201) (0.0201) (0.0296) (0.0217) (0.0081) (0.0197) β (0.0176) (0.0214) (0.0214) (0.0320) (0.0225) (0.0067) (0.0178) β (0.0098) (0.0102) (0.0102) (0.0385) (0.0109) (0.0035) (0.0094) β (0.0966) (0.0322) β (0.1068) (0.0392) β (0.1125) (0.0454) r (0.0093) (0.0095) (0.0094) (0.0097) (0.0083) RMSE L (0.0541) (0.0598) (0.0477) (0.0435) (0.0427) RMSE U (0.0535) (0.0549) (0.0478) (0.0470) (0.0472) 14

15 Table 4: Setting IV, where Y r is linear with Xj r, and β = (0, 0.3, 0, 0.5)T. Medians of estimates and measures are shown with standard errors. MCM CM CRM BCRM SCM β βc βc βr βc βr β β (0.0190) (0.0201) (0.0201) (0.0296) (0.0217) (0.0081) (0.0217) β (0.0167) (0.0214) (0.0214) (0.0320) (0.0225) (0.0067) (0.0180) β (0.0098) (0.0102) (0.0102) (0.0385) (0.0109) (0.0035) (0.0090) β (0.0966) (0.0322) β (0.1068) (0.0392) β (0.1125) (0.0454) r (0.0092) (0.0100) (0.0087) (0.0096) (0.0098) RMSE L (0.0502) (0.0591) (0.0477) (0.0435) (0.0486) RMSE U (0.0404) (0.0681) (0.0478) (0.0470) (0.0436) 15

16 Table 5: Setting V, where Y r is independent of Xj r with the bimodal setting, and β = (0, 0.3, 0, 0.5) T. Medians of estimates and measures are shown with standard errors. MCM CM CRM BCRM SCM β βc βc βr βc βr β β (0.0145) (0.0169) (0.0169) (0.0874) (0.0177) (0.0135) (0.0145) β (0.0150) (0.0189) (0.0189) (0.1052) (0.0211) (0.0138) (0.0152) β (0.0104) (0.0119) (0.0119) (0.0752) (0.0102) (0.0074) (0.0109) β (0.0853) (0.0827) β (0.1120) (0.0764) β (0.1197) (0.0721) r (0.0097) (0.0093) (0.0092) (0.0109) (0.0091) RMSE L (0.0491) (0.0501) (0.0478) (0.0506) (0.0515) RMSE U (0.0479) (0.0553) (0.0506) (0.0605) (0.0501) 16

17 Table 6: Setting VI, where Y r is independent of Xj r with the bimodal setting, and β = (0, 0.3, 0, 0.5) T. Medians of estimates and measures are shown with standard errors. MCM CM CRM BCRM SCM β βc βc βr βc βr β β (0.0191) (0.0218) (0.0218) (0.0881) (0.0221) (0.0177) (0.0219) β (0.0169) (0.0212) (0.0212) (0.0901) (0.0214) (0.0138) (0.0159) β (0.0114) (0.0130) (0.0130) (0.0929) (0.0131) (0.0059) (0.0129) β (0.1125) (0.0619) β (0.0928) (0.0695) β (0.0977) (0.0599) r (0.0085) (0.0126) (0.0083) (0.0106) (0.0087) RMSE L (0.0403) (0.0892) (0.0481) (0.0538) (0.0515) RMSE U (0.0403) (0.0892) (0.0451) (0.0491) (0.0489) 17

18 Table 7: Inference results by MCM. Medians of the Z-statistics and p-values are shown with standard errors. Setting Z F p-value Z 1 p-value Z 2 p-value Z 3 p-value I (0.3634) (0.0000) (0.1865) (0.0008) (0.1131) (0.0400) (0.1932) (0.0000) II (0.4175) (0.0000) (0.1840) (0.0012) (0.0993) (0.0391) (0.2106) (0.0000) III (0.3467) (0.0000) (0.2067) (0.0003) (0.1567) (0.0311) (0.1967) (0.0000) IV (0.3462) (0.0000) (0.1929) (0.0008) (0.1425) (0.0359) (0.2221) (0.0000) V (0.2952) (0.0000) (0.1204) (0.0037) (0.1096) (0.0387) (0.1113) (0.0000) VI (0.3092) (0.0000) (0.1435) (0.0025) (0.1210) (0.0418) (0.1709) (0.0000) NULL (0.0467) (0.0199) (0.2148) (0.0166) (0.1447) (0.0337) (0.2195) (0.0187) 18

19 5 Real Examples In this section, we analyze two real interval-valued data sets to compare the performances of the proposed and existing methods. For each data set, we use the leave-out-one principle to calculate three measures, r, RMSE L, and RMSE U, while using all the observations to obtain the estimates of regression coefficients and statistical inference using the proposed method. 5.1 Bats Species Data Set The bats species data set in Xu (2010), is a naturally occurring interval-valued data which records physical measurements of 21 different species of bats. There are four random interval-valued variables, X 1 = head size, X 2 = tail length, X 3 = forearm length, and Y = weight. Table 8: The results of Bats species data MCM CM CRM BCRM SCM β βc βc βr βc βr β β β β β β β β r RMSE L RMSE U Table 9: Bats species data, Inferences concerning regression coefficients by MCM Z F p-value Z 1 p-value Z 2 p-value Z 3 p-value Table 8 displays the results of the five considered methods. The estimates of regression coef- 19

20 ficients are somewhat similar for all the methods. The over-parametrized BCRM has the lowest RMSEs and the highest r, but does not possess good generalizability if a separate testing data set is given, as shown in Section 4. The MCM, CM, and SCM show similar performance in terms of the three measures and CRM is the least favorite for this data set. This can be explained by examining the range distribution of the response variable, which is displayed in Figure 1. It shows a bimodal distribution that substantially affects the range model. Also, this bimodal structure creates a sampling bias in the leave-out-one procedure, which costs the prediction accuracy of CRM. The inferior performance of CRM in Settings V and VI in Section 4 confirms this. Density Range of Y Figure 1: Histogram of the range of the response variable for Bats species data. Table 9 provides the statistical inference results by MCM. It can be seen that the p-value is 0 for the overall F -test, and thus the null hypothesis, H 0 : β 1 = β 2 = β 3 = 0, is rejected. The p-values for the regression coefficients of head, tail and forearm, are , , , respectively. From these inference results, one can conclude that the head has the most significant influence on the weight, and the tail has the least influence on the weight. We investigate the sampling distributions of β j, j = 1, 2, 3, with histograms and Q-Q plots in Figure 2. It can be seen that the estimates are all reasonably normally distributed. 20

21 5.2 Blood Pressure Data Set We illustrate the comparison of the five methods using the second example, the blood pressure data set from Billard and Diday (2000). The blood pressure data set contains 11 observations with three explanatory variables pulse rate, systolic pressure and diastolic pressure. We exclude the 11th data point from our analysis because it is known that systolic pressure should be higher than diastolic pressure (Xu, 2010). Each value of the data is an interval since pulse rates and blood pressure values fluctuate considerably. Suppose that X 1 = systolic pressure, X 2 = diastolic pressure, and Y = pulse rate. Table 10: The results of Blood pressure data MCM CM CRM BCRM SCM β βc βc βr βc βr β β β β β β r RMSE L RMSE U Table 11: Blood pressure data, Inferences concerning regression coefficients by MCM Z F p-value Z 1 p-value Z 2 p-value The consistency across all the methods for the estimated coefficients found in the previous example is no longer observed for this data set. According to the three measures, the SCM shows the best and BCRM the worst performance. However, Table 11 suggests that the model is not significant at α =.05, which makes the aforementioned measures uninterpretable. The tests of individual coefficients are consistent to the overall F -test result as their p-values are and 21

22 for systolic pressure and diastolic pressure, respectively. The histograms and Q-Q plots of the estimates from 1000 Monte Carlo replications are shown in Figure 3. The systolic pressure and diastolic pressure coefficients are roughly normally distributed, but some outlying cases are identified, suggesting somewhat mild non-normal behaviors. 6 Conclusion In this paper, we propose a new method for fitting a linear regression model on interval-valued data using Monte Carlo simulation. The proposed method randomly generates a large number, say B, of single-valued data sets, each of which consists of points randomly chosen within observed intervals. For each of the B data sets, a classical regression model is fitted by the least squares estimation. The final model for the original interval-valued data is then obtained by aggregating the B estimated models. In this way, the proposed method fully utilizes internal variation in the interval-valued observations. Furthermore, using the sampling distributions obtained from the random sampling process, one can carry out statistical inference on regression coefficients such as confidence intervals and hypothesis tests in a straightforward way. The backbone idea of this work can be readily extended to other statistical problems such as principal component analysis, classification, and clustering, etc. For example, for binary classification, one can obtain B classification rules f 1,..., f B based on single-valued data sets generated by Monte Carlo sampling used in this paper. Then predicting the class label for a new interval-valued observation can be made as follows: 1) randomly generate a single point from the given interval and apply f j, j = 1,..., B and record the majority vote among the B predicted labels. 2) repeat the previous step for a reasonably large number of times, say C. Finally the predicted label for the new interval is determined by the majority vote among the C predictions. Multi-category classification can be approached in a similar way. One possible drawback of the proposed approach is that it is computationally intensive. By the nature of Monte Carlo simulation, a larger number of repetitions is always desired. However, we believe that this disadvantage is outweighed by its good properties. First, it relieves the need to develop complex methodologies for interval-valued data. Second, it can be immediately applied to another popular type of symbolic data, which is histogram-valued data. Third, it is flexible in the 22

23 sense that one can modify the Monte Carlo resampling scheme depending on specific problems. For example, one may assume a non-uniform distribution within an interval such as truncated normal distribution. References Alfonso, F., Billard, L., and Diday, E. (2004). Symbolic linear regression with taxonomies. In Banks, D., House, L., McMorris, F., Arabie, P., and Gaul, W., editors, Classification, Clustering and Data Mining Applications, pages Springer-Verlag, Berlin. Bertrand, P. and Goupil, F. (2000). Descriptive statistics for symbolic data. In Bock, H.-H. and Diday, E., editors, Analysis of Symbolic Data, pages Springer-Verlag, Berlin. Billard, L. (2007). Dependencies and variation components of symbolic interval-valued data. In Brito, P., Cucumel, G., Bertrand, P., and de Carvalho, F., editors, Selected Contributions in Data Analysis and Classification, pages Springer-Verlag, Berlin. Billard, L. (2008). Sample covariance functions for complex quantitative data. In World Congress, International Association of Computational Statistics, Yokohama, Japan. Billard, L. (2011). Brief overview of symbolic data and analytic issues. Statistical Analysis and Data Mining, 4: Billard, L. and Diday, E. (2000). Regression analysis for interval-valued data. In Kiers, H. A. L., Rassoon, J.-P., Groenen, P. J. F., and Schader, M., editors, Data Analysis, Classification, and Related Methods, pages Springer-Verlag, Berlin. Billard, L. and Diday, E. (2002). Symbolic regression analysis. In Classification, Clustering and Data Analysis: Proceedings of the 8th Conference of the International Federation of Classification Societies (IFCS 02), pages Springer, Poland. Billard, L. and Diday, E. (2003). From the statistics of data to the statistics of knowledge: Symbolic data analysis. Journal of the America Statistical Association, 98: Billard, L. and Diday, E. (2007). Symbolic Data Analysis: Conceptual Statistics and Data Mining. Wiley, Chichester. 23

24 Diday, E. (1995). Probabilist, possibilist and belief object for knowledge analysis. Annals of Operations Reseatch, 55: Diday, E. (2003). An introduction to symbolic data analysis and the sodas software. Journal of Symbolic Data Analysis, 7: Diday, E. and Emilion, R. (1996). Lattices and capacities in analysis of probabilist object. In Diday, E., Lechevallier, Y., and Opilz, O., editors, Studies in Classification, pages Diday, E. and Emilion, R. (1998). Capacities and credibilities in analysis of probabilistic objects by histograms and lattices. In Hayashi, C., Ohsumi, N., Yajima, K., Tanaka, Y., Bock, H. H., and Baba, Y., editors, Data Science, Classification, and Related Methods, pages Good, P. (2006). Resampling Methods. Birkhauser. Lima Neto, E.,, Cordeiro, G., and de Carvalho, F. (2011). Bivariate symbolic regression models for interval-valued variables. Journal of Statistical Computation and Simulation, 81: Lima Neto, E. and de Carvalho, F. (2008). Center and range method for fitting a linear regression model to symbolic interval data. Computational Statistics and Data Analysis, 52: Lima Neto, E. and de Carvalho, F. (2010). Constrained linear regression models for symbolic interval-valued variables. Computational Statistics and Data Analysis, 54: Lima Neto, E. A., Cordeiro, G. M., Carvalho, F. A. T., Anjos, U., and Costa, A. (2009). Bivariate generalized linear model for interval-valued variables. In Proceedings 2009 IEEE International Joint Conference on Neural Networks, volume 1, pages , Atlanta, USA. Lima Neto, E. A., de Carvalho, F. A. T., and Tenorio, C. P. (2004). Univariate and multivariate linear regression methods to pridict interval-valued features. In Lecture Notes in Computer Science, AI 2004 Advances in Artificial Intelligence, pages Springer-Verlag, Berlin. Maia, A. and Carvalho, F. D. (2008). Fitting a least absolute deviation regression model on symbolic interval data. In Lecture Notes in Artificial Intelligence: Proceedings of the Ninth Brazilian Symposium on Artificial Intelligence, pages Springer-Verlag, Berlin. 24

25 Noirhomme-Fraiture, M. and Brito, P. (2011). Far beyond the classical data models: symbolic data analysis. Statistical Analysis and Data Mining, 4: Silva, A., Lima Neto, E. A., and Anjos, U. (2011). A regression model to interval-valued variables based on copula approach. In Proceedings of the 58th World Statistics Congress of the International Statistical Institute, Dublin, Ireland. Xu, W. (2010). Symbolic Data Analysis: Interval-Valued Data Regression. PhD thesis, University of Georgia. 25

26 Density Head Coefficient Head Coefficient Theoretical Quantiles (a) Histogram for Head (b) Q-Q plot for Head Density Tail Coefficient Tail Coefficient Theoretical Quantiles (c) Histogram for Tail (d) Q-Q plot for Tail Density Forearm Coefficient Forearm Coefficient Theoretical Quantiles (e) Histogram for Forearm (f) Q-Q plot for Forearm Figure 2: Bats species data, Histograms and Q-Q plots of the coefficient estimates from 1000 resampling data sets. 26

27 Density Systolic Pressure Coefficient Systolic Pressure Coefficient Theoretical Quantiles (a) Histogram for Systolic Pressure (b) Q-Q plot for Systolic Pressure Density Diastolic Pressure Coefficient Diastolic Pressure Coefficient Theoretical Quantiles (c) Histogram for Diastolic Pressure (d) Q-Q plot for Diastolic Pressure Figure 3: Blood Pressure data, Histograms and Q-Q plots of the coefficient estimates from 1000 resampling data sets. 27

A Nonparametric Kernel Approach to Interval-Valued Data Analysis

A Nonparametric Kernel Approach to Interval-Valued Data Analysis Yongho Jeon Department of Applied Statistics, Yonsei University, Seoul, 120-749, Korea Jeongyoun Ahn, Cheolwoo Park Department of Statistics,