A Resampling Approach for Interval-Valued Data Regression

Size: px
Start display at page:

Download "A Resampling Approach for Interval-Valued Data Regression"

Transcription

1 A Resampling Approach for Interval-Valued Data Regression Jeongyoun Ahn, Muliang Peng, Cheolwoo Park Department of Statistics, University of Georgia, Athens, GA, 30602, USA Yongho Jeon Department of Applied Statistics, Yonsei University, Seoul, , Korea March 4, 2012 Abstract We consider interval-valued data that frequently appear with advanced technologies in current data collection processes. Interval-valued data refer to the data that are observed as ranges instead of single values. In the last decade, several approaches to the regression analysis of interval-valued data have been introduced, but little work has been done on relevant statistical inferences concerning the regression model. In this paper, we propose a new approach to fit a linear regression model to interval-valued data using a resampling idea. A key advantage is that it enables one to make inferences on the model such as the overall model significance test and individual coefficient test. We demonstrate the proposed approach using simulated and real data examples, and also compare its performance with those of existing methods. Key words: Interval-valued data; Linear regression; Resampling; Statistical inference. 1 Introduction Even with ever-advanced computing power, analyzing data sets of massive size is often not practical. As a consequence, aggregating information from such huge databases is routinely done. For example, a person s weekly expenditure during a certain time period can be aggregated to an interval with bounds of the lowest and highest weekly spending. More elaborate aggregation can be 1

2 made by constructing a histogram of a person s weekly expenditure. Another source of intervalvalued data can be the innate nature of a variable, for example a person s blood pressure is often measured as an interval of systolic and diastolic pressures. Interval-valued data belong to a more general class of data type called symbolic data (Diday, 2003; Billard and Diday, 2003, 2007). Typical examples of symbolic data include interval, histogram, and modal-valued data. These types of data are difficult to analyze with classical methods. A main challenge is how to account for internal variation or certain structure within an observation. There have been substantial efforts to extend classical methods to the symbolic data for various analysis problems (Diday, 1995; Diday and Emilion, 1996, 1998). For more recent developments for different types of symbolic data analysis, see Billard (2011); Noirhomme-Fraiture and Brito (2011). This paper focuses on regression analysis with interval-valued data, which is the most frequently observed symbolic data type. There have been several linear regression approaches for intervalvalued data. The first work is proposed by Billard and Diday (2000). Their main idea is to build a classical linear regression model using center points of the observed intervals and to use the fitted model for predicting the lower and upper bounds of the interval of the response variable. Also, Billard and Diday (2002) fits individual models for lower and upper bounds of the intervals and estimate the bounds of the response variable separately. However, neither of these approaches takes internal variation of the observed intervals into account in the estimation of the model. In an attempt to resolve this problem, Lima Neto et al. (2004) propose the center and range method that fits two different linear regression models for the center points and the ranges of the intervals, respectively. They also make separate predictions for the center and the range of the response variable. Billard and Diday (2007) also attempt to account for the width of intervals by developing the bivariate center and range method. Their method constructs two regression models using both the center points and ranges of intervals as the predictors. However, these approaches still have some limitations because they approximate the internal variation with the range only. Another issue sometimes encountered from these approaches is that the predicted lower bounds can be bigger than the predicted upper bounds, when a slope estimate is negative. Lima Neto and de Carvalho (2010) suggest to use constraints that guarantee the coherence between the predicted values of the lower and upper bounds. Recently, Xu (2010) proposes a symbolic covariance method to address these two issues based on the symbolic sample 2

3 covariance suggested by Billard (2007, 2008). Other notable works on interval-valued regression analysis include Alfonso et al. (2004), where they include taxonomy variables as predictor variables to account for hierarchical structures among symbolic variables, and Maia and Carvalho (2008), where they develop a robust method via a least absolute deviation regression model. These existing methods are useful for the model estimation purposes, but we note that only a few of them attempt to make a statistical inference on the estimated model. For example, Lima Neto et al. (2009) and Lima Neto et al. (2011) construct a bivariate generalized linear model based on a bivariate exponential family of distributions. Silva et al. (2011) propose a regression model using copula, which allows statistical inference under the framework of joint distributions for bivariate random vectors. In this work we propose a new regression approach for interval-valued data based on resampling. We propose to (i) generate a large number of samples by randomly selecting a single-valued point within each observed intervals; (ii) fit a classical linear regression model on each single-valued sample; (iii) calculate the mean regression coefficients over the fitted models, and then use them to produce a final linear regression model. The main contributions of the proposed method are two-fold; (i) it fully makes use of the variability of the interval-valued data; (ii) more importantly, it enables one to make inferences regarding the regression coefficients because the sampling distributions of the estimated coefficients are obtained via Monte Carlo sampling. Even though we use the linear regression to demonstrate the methodology, the proposed approach can be expanded further to build a unified framework for numerical-valued symbolic data such as interval-valued and histogram-valued data. Also, the core idea of this work can be straightforwardly applied to other statistical problems such as discrimination analysis. Section 2 reviews some of the existing linear regression methods for interval-valued data. We discuss the proposed method in Section 3. The proposed method is applied to both simulated and real examples in Sections 4 and 5, and is compared with the methods described in Section 2. The paper ends with conclusion in Section 6. 3

4 2 Regression with Interval-valued Data Let X 1,..., X p be p explanatory variables, and Y be the response variable, both of which realizations are intervals: X ij = [X Lij, X Uij ] (X Lij X Uij ) and Y i = [Y Li, Y Ui ] (Y Li Y Ui ), for i = 1,..., n and j = 1,..., p. We consider the following linear regression model: Y = Xβ + ϵ, (1) where Y = (Y 1,..., Y n ) T, X = (X 1,..., X n ) T, X i = (1, X i1,..., X ip ) T for i = 1,..., n, β = (β 0, β 1,..., β p ) T, ϵ = (ϵ 1,..., ϵ n ) T and ϵ i independently follows N(0, σ 2 ). Several linear regression approaches for interval-valued data have been proposed over recent years. Among them we discuss four existing methods in this paper for comparison purposes. Xu (2010) thoroughly illustrates the four methods and we summarize them here using the similar notations. The first work on this problem is the center method (CM) proposed by Billard and Diday (2000). The CM fits a linear regression model on the mid-points of the interval values. Let X1 c,..., Xc p be the center points of the intervals of the variables X 1,..., X p, and Y c be the center point of Y. Then, CM converts the model (1) into a classical linear regression model by Y c = X c β c + ϵ c, (2) where Y c = (Y c 1,..., Y c n ) T, X c = (X c 1,..., Xc n) T, X c i = (1, Xc i1,..., Xc ip )T for i = 1,..., n, β c = (β c 0, βc 1,..., βc p) T, ϵ c = (ϵ c 1,..., ϵc n) T. Then, β c can be estimated by the usual least squares method. Suppose a new observation (x 0 L, x0 U ) is obtained, where x0 L = (1, x0 L1,..., x0 Lp ) and x0 U = (1, x 0 U1,..., x0 Up ). The prediction of the lower and upper bounds of Y = [Y L, Y U ] that is originally given by Billard and Diday (2000) is Ŷ L = x 0 L β c, Ŷ U = x 0 U β c. (3) Xu (2010) notices that the use of (3) could yield a lower bound higher than an upper bound. Hence he suggests the following modified prediction Ŷ L = min(x 0 L β c, x 0 U β c ), Ŷ U = max(x 0 L β c, x 0 U β c ), (4) which will also be used for our proposed method. 4

5 The CM method is simple to implement and have some good empirical properties as shown in Section 4, but as Lima Neto and de Carvalho (2008) point out, it does not utilize the information from the ranges of the intervals in the estimation step. In other words, it does not take internal variations of the intervals into account. Furthermore, it is impossible to make statistical inferences on the regression coefficients because the sampling variances cannot be estimated from the procedure. In order to resolve this problem, Lima Neto et al. (2004) propose the center and range method (CRM) that fits two separate linear regression models with the center points and the ranges of the intervals, respectively. That is, they consider the ranges of the intervals in the estimation and prediction, as well as the centers. For the center model, they use the same model given in (2). For the range model, let X r 1,..., Xr p be the p ranges of the intervals of X 1,..., X p, and Y r be the range of the interval of Y. Also, let the observed values of X r j be Xr ij = (X Uij X Lij ), and the observed value of Y r i be Y r i model on the ranges is given by = (Y Ui Y Li ) where i = 1,..., n, and j = 1,..., p. Then, the linear regression Y r = X r β r + ϵ r, (5) where Y r = (Y r 1,..., Y r n ) T, X r = (X r 1,..., Xr n) T, X r i = (1, X r i1,..., Xr ip )T for i = 1,..., n, β r = (β r 0, βr 1,..., βr p) T and ϵ r = (ϵ r 1,..., ϵr n) T. Again, β r can be estimated by the usual least squares estimation. Then, separate predictions are made for the center and the range of the response variable. Finally, the predicted interval Ŷ = [ŶL, ŶU] is given by Ŷ L = Ŷ c Ŷ r /2, Ŷ U = Ŷ c + Ŷ r /2, where Ŷ c and Ŷ r are the predicted values under the models (2) and (5), respectively. The CRM counts on the assumption of independence between mid-points and ranges, which may not hold in general. Also, even though the variations in the intervals are explained through the ranges, it is not clear how they are transferred to the variation of the estimated regression coefficients. Billard and Diday (2007) also attempt to account for the width of intervals by developing the bivariate center and range method (BCRM). As in CRM, their method constructs two separate regression models for predicting both the center points and ranges of intervals. But, BCRM utilizes both the centers and ranges as predictors in the two models. These two models can be expressed 5

6 as follows: Y c = X cr β c + ϵ c, Y r = X cr β r + ϵ r, where X cr = (X cr 1,..., Xcr n ) T, X cr i = (1, Xi1 c,..., Xc ip, Xr i1,..., Xr ip )T for i = 1,..., n. Note that the number of columns in the design matrix X cr is 2p + 1 for this model while it is p + 1 for CRM. The estimation and prediction can be similarly done as in CRM. Since the BCRM has more predictors than CM and CRM, it is expected to have higher explaining power for the current data set. However, as we will see in Section 4, its prediction accuracy for future data is not necessarily better than other methods. Also again, statistical inference for the estimated models is not considered. The three approaches presented so far apply a classical regression modeling by converting interval-valued data set into a classical one. Xu (2010) proposes the symbolic covariance method (SCM) that uses the symbolic covariance matrix proposed by Billard (2007, 2008). He considers the following model with centered variables, for which the least squares estimator is given by Y Ȳ = (X X)β + ϵ, β = {(X X) T (X X)} 1 (X X) T (Y Ȳ) = S 1 XX S XY, (6) where S XX is the symbolic sample variance-covariance matrix of the predictors and S XY is the vector of the symbolic sample covariance between Y and the predictors. The symbolic sample covariance between interval-valued variables X j and X k is defined as follows (Billard, 2007, 2008): Cov(X j, X k ) = (6n) 1 n i=1 [2(X Lij X j )(X Lik X k ) + (X Lij X j )(X Uik X k ) +(X Uij X j )(X Lik X k ) + 2(X Uij X j )(X Uik X k )], (7) where the symbolic sample mean of X j is defined as (Bertrand and Goupil, 2000) n X j = (2n) 1 (X Lij + X Uij ). 3 Proposed Approach i=1 In this section, we propose a new method, called Monte Carlo method (MCM), that utilizes a resampling idea (Good, 2006) to fit a linear regression model on the interval-valued data. It not 6

7 only accounts for the internal variation of the intervals but also derives approximate sampling distributions of the estimated regression coefficients via Monte Carlo resampling. Note that an interval-valued data set with p predictors and one response variable can be represented with n hypercubes in (p + 1)-dimensional space. Then, the MCM procedure is implemented as follows: For b = 1,..., B, 1. Generate a vector-valued multivariate random sample S b of size n in R p+1, in which each vector is randomly generated from a multivariate uniform distribution on each of the n hypercubes. In other words, using the uniform distribution, randomly generate a single-valued data point X b ij from the interval of the jth predictor X ij = [a ij, b ij ] and Y b i from the interval of the response variable Y i = [c i, d i ], respectively for i = 1,..., n and j = 1,..., p, and obtain a random vector (Yi b, Xi1 b,..., X b ip )T. 2. Construct the bth random single-valued regression sample, Y b = (Y b 1,..., Y b n ) T, X b j = (X b 1j,..., X b nj )T for j = 1,..., p. 3. Obtain the bth regression coefficient estimates with the regression sample in Step 2 above via the least squares estimation: ˆβ b = {(X b ) T (X b )} 1 (X b ) T Y b, i.e., where X b = (1, X b 1,..., X b p ) T and β b = ( β b, β b,..., β b p ) T. 0 The final estimated model for the interval-valued data is obtained by averaging B estimates, Ŷ = β 0 + β 1X β px p, 1 where for j = 0,..., p. β j = 1 B B b=1 β b j, Note that we choose the name Monte Carlo method for the proposed method because singlevalued data sets are randomly generated from the observed intervals. The choice of B is dependent on the sizes of the hypercubes, since we want the randomly selected points to cover most of the volume of the hypercubes. We use B = 1000 for the examples in this paper. 7

8 In order to predict the lower and upper bounds of Y, we use the equation (4). The use of this equation implies that the proposed method, as well as CM and SCM, assumes that the bounds of the response and predictors have the same relationship as their centers. This assumption might not be necessarily true in general, however, our empirical studies in Sections 4 and 5 show competitive performance of these methods, even in the settings where the assumption is clearly violated. The most important contribution of the proposed method is that one can conduct statistical significance test on the regression coefficients. A test for overall model significance, i.e., for the hypothesis H 0 : β 1 = = β p = 0, can be constructed as follows. 1. Calculate the overall model significance F test statistic for each resampling data set. The bth F -statistic is given by where RSS b = n i=1 (Y b i F b = SS b reg/p RSS b /(n p 1), Ŷ i b ) 2 and SSreg b = n i=1 (Ŷ b i Ȳ b ) 2. Note that F b follows an F distribution with p and (n p 1) degrees of freedom under H Calculate the Z-statistic Z F = F M F VF, (8) where F = 1 B B b=1 F b, M F = n p 1 n p 3, V F = 2(n p 1)2 (n 3) p(n p 3) 2 (n p 5). 3. The p-value for the overall model significance test is given as P ( Z z F ) where Z N(0, 1) and z F is the observed value from the data. We note that if U follows an F distribution with the degrees of freedom (ν 1, ν 2 ), then E(U) = ν 2 ν 2 2, V (U) = 2ν2 2 (ν 1 + ν 2 2) ν 1 (ν 2 2) 2 (ν 2 4). Therefore, Z F in (8) approximately follows the standard normal distribution. We can also construct 100(1 α)% confidence intervals for the coefficients. The standard error of β j is calculated as SE( β j) = B b=1 ( β b j β B 1 j) 2, 8

9 and then the 100(1 α)% confidence interval for the regression coefficient β j is given by β j ± z α/2 SE( β j), (9) where z α represents the 100(1 α)% quantile of the standard normal distribution. This approach would work well if the distribution of the B coefficient estimates is approximately normally distributed. One can examine the normality of the estimates using a histogram and Q-Q plot. In addition to providing standard errors and confidence intervals, MCM can conduct statistical hypothesis tests for the individual regression coefficients, that is, H 0 : β j = 0 for j = 1,..., p. The test statistics can then be calculated as Z j = β j, (10) SE( β j) and Z j approximately follows N(0, 1) under the null hypothesis. Then, the p-value can be found using the standard normal distribution. 4 A simulation study In this section, we compare the five methods, CM, CRM, BCRM, SCM, and the proposed MCM, via simulated data sets, which are generated in the following way. Let X 1, X 2, X 3 be three intervalvalued explanatory variables, and Y be the interval-valued response variable. 1. Suppose that X c ij is the center of jth explanatory variable, and Y c i variable of the ith observation. Assume that X c i1, Xc i2, Xc i3 and Y c i follows: We consider two sets of regression coefficients, β T is the center of response have a linear relation as Y c i = β 0 + β 1 X c i1 + β 2 X c i2 + β 3 X c i3 + ϵ c i. (11) := (β 0, β 1, β 2, β 3 ) = (0, 0.3, 0, 0.5) and β T = (0, 0.3, 0, 0.5). The error distribution follows ϵ c i N(0, σ2 c ) where σ c = For the ith observation, randomly generate X c 1j uniformly from {11, 12,..., 20}, Xc 2j from {21, 22,..., 30}, and X c 3j model in (11). Here, we set n = 30. from {31, 32,..., 50}, for i = 1,..., n. Also generate Y c i from the 9

10 3. Suppose that X r ij is the range of jth explanatory variable, and Y r i is the range of response variable of the ith observation. For the ith observation, generate Xi1 r randomly from U(1, 2), X r i2 from U(2, 3), Xr i3 scenarios: from U(3, 4), for i = 1,..., n. For Y r i s, we consider three different (i) Settings I and II: independently generate them from U(1, 2) (ii) Settings III and IV: generate them from the following relationship: Y r i = γ 0 + γ 1 X r i1 + γ 2 X r i2 + γ 3 X r i3 + ϵ r i. Here, (γ 0, γ 1, γ 2, γ 3 ) = (0, 0.1, 0, 0.2) and ϵ r i N(0, σ2 r) where σ r = 0.5. (iii) Settings V and VI: for training data sets, independently generate them from U(3, 5) if the amplitudes of the corresponding Yi c s belong to top 30%; otherwise, from U(1, 2). For testing data sets, we use top 10% instead of 30% to implement a slight sampling bias. This mixture setting is motivated by the Bats species data set analyzed in Section Calculate the bounds of the ith observation as [Xij c Xr ij, Xc ij + Xr ij ]. Similarly, calculate the bounds of Y i as [Y c i Yi r, Y i c + Yi r]. In order to assess the performance of the five methods we use various criteria from existing literature. We generate independent testing sets with the sample size 30 to calculate the following criteria. The lower bound root mean-square error (RMSE L ) and the upper bound root mean-square error (RMSE U ) proposed by Lima Neto and de Carvalho (2008) measure the differences between the predicted values [ŶL i, ŶU i ] and the observed values [Y Li, Y Ui ]: RMSE L = n i=1 (Y L i ŶL i ) 2 n and RMSE U = n i=1 (Y U i ŶU i ) 2. n The symbolic correlation coefficient, r, proposed by Billard (2007, 2008) and applied to the regression problem by Xu (2010), measures the correlation between the predicted values [ŶL i, ŶU i ] and the observed values [Y Li, Y Ui ]: r(y, Ŷ ) = Cov(Y, Ŷ ), S Y SŶ where Cov(Y, Ŷ ) is the symbolic covariance between Y and Ŷ in equation (7). The S Y and SŶ are the standard deviations of Y i and Ŷi respectively, which can be computed as the following (Bertrand 10

11 and Goupil, 2000): for the interval-valued random sample Y i = [a i, b i ], i = 1,..., n of the variable Y, the symbolic sample variance is [ n n 2 SY 2 = (3n) 1 i=1(a 2 i + a i b i + b 2 i ) (2n) 2 (a i + b i )]. i=1 We summarize the simulation results in Tables 1 6. In each table, we report the median of regression coefficients from 100 repetitions along with the standard errors using the median absolute deviation. We also report the median of the aforementioned evaluation measures calculated with the testing data at the bottom part of each table. Overall, the coefficient estimates by all methods are similar to one another and reasonably close to the true coefficients, with an exception of SCM when β 3 is negative. In terms of the prediction measures, MCM and CRM are the best in all settings. We note that BCRM shows poor prediction performance with separate testing data sets, possibly due to over-parametrization. The simplest method CM performs reasonably well in general, however, since it does not take care of the issue of switching lower and upper bounds, it shows inferior performance when there is a negative coefficient. Since CRM fits both the center and the range models, it is supposed to be the best in Settings III and IV whose expected results are shown in Tables 3 and 4. In these cases, it is notable that the proposed MCM performs similarly to CRM. In Settings V and VI where the data contain the bimodal structure and mild sampling bias, MCM works best in almost all measures as seen in Tables 5 and 6. In particular, compared to CRM that shows the superiority in the other settings, MCM yields higher r and lower RMSE values for both settings. This demonstrates that the proposed method is a robust choice in various circumstances. It may not be the best performer for a single setting, but it outperforms the others in one way or another. As explained in Section 3, the most important contribution of the proposed MCM is that it can make statistical inference for the estimated model in a natural way. Table 7 displays the median of the overall model significance test statistics, Z-statistics, and p-values for individual significance tests, from 100 repetitions along with standard errors. The MCM provides correct inference results for all cases; it correctly rejects F -test with very small p-values, rejects H 0 : β 1 = 0 and H 0 : β 3 = 0, and does not reject H 0 : β 2 = 0. The p-values for H 0 : β 1 = 0 are higher compared to those for H 0 : β 3 = 0 since the magnitude of the true β 1 is closer to zero. Using these statistical inference results, one can identify important variables and further increase prediction accuracy by removing 11

12 insignificant variables from the regression model. We also run another simulation with a null model whose coefficients are all zero under Setting I. Its results are also summarized at the bottom of Table 7. The median p-value of the Z F test for model significance is , thus we successfully do not reject the null hypothesis of H 0 : β 1 = β 2 = β 3 = 0. Table 1: Setting I, where Y r is independent of Xj r, and β = (0, 0.3, 0, 0.5)T. Medians of estimates and measures are shown with standard errors. MCM CM CRM BCRM SCM β βc βc βr βc βr β β (0.0204) (0.0212) (0.0212) (0.0221) (0.0229) (0.0042) (0.0222) β (0.0138) (0.0166) (0.0166) (0.0220) (0.0158) (0.0053) (0.0124) β (0.0104) (0.0119) (0.0119) (0.0217) (0.0121) (0.0019) (0.0117) β (0.1188) (0.0204) β (0.0901) (0.0232) β (0.1335) (0.0221) r (0.0082) (0.0085) (0.0089) (0.0106) (0.0087) RMSE L (0.0531) (0.0535) (0.0493) (0.0523) (0.0538) RMSE U (0.0504) (0.0523) (0.0498) (0.0539) (0.0469) 12

13 Table 2: Setting II, where Y r is independent of Xj r, and β = (0, 0.3, 0, 0.5)T. Medians of estimates and measures are shown with standard errors. MCM CM CRM BCRM SCM β βc βc βr βc βr β β (0.0193) (0.0212) (0.0212) (0.0221) (0.0229) (0.0042) (0.0201) β (0.0118) (0.0166) (0.0166) (0.0220) (0.0158) (0.0052) (0.0152) β (0.0101) (0.0119) (0.0119) (0.0217) (0.0121) (0.0019) (0.0110) β (0.1188) (0.0204) β (0.0901) (0.0232) β (0.1335) (0.0221) r (0.0072) (0.0092) (0.0068) (0.0104) (0.0099) RMSE L (0.0563) (0.0659) (0.0493) (0.0528) (0.0523) RMSE U (0.0479) (0.0933) (0.0498) (0.0509) (0.0539) 13

14 Table 3: Setting III, where Y r is linear with Xj r, and β = (0, 0.3, 0, 0.5)T. Medians of estimates and measures are shown with standard errors. MCM CM CRM BCRM SCM β βc βc βr βc βr β β (0.0201) (0.0201) (0.0201) (0.0296) (0.0217) (0.0081) (0.0197) β (0.0176) (0.0214) (0.0214) (0.0320) (0.0225) (0.0067) (0.0178) β (0.0098) (0.0102) (0.0102) (0.0385) (0.0109) (0.0035) (0.0094) β (0.0966) (0.0322) β (0.1068) (0.0392) β (0.1125) (0.0454) r (0.0093) (0.0095) (0.0094) (0.0097) (0.0083) RMSE L (0.0541) (0.0598) (0.0477) (0.0435) (0.0427) RMSE U (0.0535) (0.0549) (0.0478) (0.0470) (0.0472) 14

15 Table 4: Setting IV, where Y r is linear with Xj r, and β = (0, 0.3, 0, 0.5)T. Medians of estimates and measures are shown with standard errors. MCM CM CRM BCRM SCM β βc βc βr βc βr β β (0.0190) (0.0201) (0.0201) (0.0296) (0.0217) (0.0081) (0.0217) β (0.0167) (0.0214) (0.0214) (0.0320) (0.0225) (0.0067) (0.0180) β (0.0098) (0.0102) (0.0102) (0.0385) (0.0109) (0.0035) (0.0090) β (0.0966) (0.0322) β (0.1068) (0.0392) β (0.1125) (0.0454) r (0.0092) (0.0100) (0.0087) (0.0096) (0.0098) RMSE L (0.0502) (0.0591) (0.0477) (0.0435) (0.0486) RMSE U (0.0404) (0.0681) (0.0478) (0.0470) (0.0436) 15

16 Table 5: Setting V, where Y r is independent of Xj r with the bimodal setting, and β = (0, 0.3, 0, 0.5) T. Medians of estimates and measures are shown with standard errors. MCM CM CRM BCRM SCM β βc βc βr βc βr β β (0.0145) (0.0169) (0.0169) (0.0874) (0.0177) (0.0135) (0.0145) β (0.0150) (0.0189) (0.0189) (0.1052) (0.0211) (0.0138) (0.0152) β (0.0104) (0.0119) (0.0119) (0.0752) (0.0102) (0.0074) (0.0109) β (0.0853) (0.0827) β (0.1120) (0.0764) β (0.1197) (0.0721) r (0.0097) (0.0093) (0.0092) (0.0109) (0.0091) RMSE L (0.0491) (0.0501) (0.0478) (0.0506) (0.0515) RMSE U (0.0479) (0.0553) (0.0506) (0.0605) (0.0501) 16

17 Table 6: Setting VI, where Y r is independent of Xj r with the bimodal setting, and β = (0, 0.3, 0, 0.5) T. Medians of estimates and measures are shown with standard errors. MCM CM CRM BCRM SCM β βc βc βr βc βr β β (0.0191) (0.0218) (0.0218) (0.0881) (0.0221) (0.0177) (0.0219) β (0.0169) (0.0212) (0.0212) (0.0901) (0.0214) (0.0138) (0.0159) β (0.0114) (0.0130) (0.0130) (0.0929) (0.0131) (0.0059) (0.0129) β (0.1125) (0.0619) β (0.0928) (0.0695) β (0.0977) (0.0599) r (0.0085) (0.0126) (0.0083) (0.0106) (0.0087) RMSE L (0.0403) (0.0892) (0.0481) (0.0538) (0.0515) RMSE U (0.0403) (0.0892) (0.0451) (0.0491) (0.0489) 17

18 Table 7: Inference results by MCM. Medians of the Z-statistics and p-values are shown with standard errors. Setting Z F p-value Z 1 p-value Z 2 p-value Z 3 p-value I (0.3634) (0.0000) (0.1865) (0.0008) (0.1131) (0.0400) (0.1932) (0.0000) II (0.4175) (0.0000) (0.1840) (0.0012) (0.0993) (0.0391) (0.2106) (0.0000) III (0.3467) (0.0000) (0.2067) (0.0003) (0.1567) (0.0311) (0.1967) (0.0000) IV (0.3462) (0.0000) (0.1929) (0.0008) (0.1425) (0.0359) (0.2221) (0.0000) V (0.2952) (0.0000) (0.1204) (0.0037) (0.1096) (0.0387) (0.1113) (0.0000) VI (0.3092) (0.0000) (0.1435) (0.0025) (0.1210) (0.0418) (0.1709) (0.0000) NULL (0.0467) (0.0199) (0.2148) (0.0166) (0.1447) (0.0337) (0.2195) (0.0187) 18

19 5 Real Examples In this section, we analyze two real interval-valued data sets to compare the performances of the proposed and existing methods. For each data set, we use the leave-out-one principle to calculate three measures, r, RMSE L, and RMSE U, while using all the observations to obtain the estimates of regression coefficients and statistical inference using the proposed method. 5.1 Bats Species Data Set The bats species data set in Xu (2010), is a naturally occurring interval-valued data which records physical measurements of 21 different species of bats. There are four random interval-valued variables, X 1 = head size, X 2 = tail length, X 3 = forearm length, and Y = weight. Table 8: The results of Bats species data MCM CM CRM BCRM SCM β βc βc βr βc βr β β β β β β β β r RMSE L RMSE U Table 9: Bats species data, Inferences concerning regression coefficients by MCM Z F p-value Z 1 p-value Z 2 p-value Z 3 p-value Table 8 displays the results of the five considered methods. The estimates of regression coef- 19

20 ficients are somewhat similar for all the methods. The over-parametrized BCRM has the lowest RMSEs and the highest r, but does not possess good generalizability if a separate testing data set is given, as shown in Section 4. The MCM, CM, and SCM show similar performance in terms of the three measures and CRM is the least favorite for this data set. This can be explained by examining the range distribution of the response variable, which is displayed in Figure 1. It shows a bimodal distribution that substantially affects the range model. Also, this bimodal structure creates a sampling bias in the leave-out-one procedure, which costs the prediction accuracy of CRM. The inferior performance of CRM in Settings V and VI in Section 4 confirms this. Density Range of Y Figure 1: Histogram of the range of the response variable for Bats species data. Table 9 provides the statistical inference results by MCM. It can be seen that the p-value is 0 for the overall F -test, and thus the null hypothesis, H 0 : β 1 = β 2 = β 3 = 0, is rejected. The p-values for the regression coefficients of head, tail and forearm, are , , , respectively. From these inference results, one can conclude that the head has the most significant influence on the weight, and the tail has the least influence on the weight. We investigate the sampling distributions of β j, j = 1, 2, 3, with histograms and Q-Q plots in Figure 2. It can be seen that the estimates are all reasonably normally distributed. 20

21 5.2 Blood Pressure Data Set We illustrate the comparison of the five methods using the second example, the blood pressure data set from Billard and Diday (2000). The blood pressure data set contains 11 observations with three explanatory variables pulse rate, systolic pressure and diastolic pressure. We exclude the 11th data point from our analysis because it is known that systolic pressure should be higher than diastolic pressure (Xu, 2010). Each value of the data is an interval since pulse rates and blood pressure values fluctuate considerably. Suppose that X 1 = systolic pressure, X 2 = diastolic pressure, and Y = pulse rate. Table 10: The results of Blood pressure data MCM CM CRM BCRM SCM β βc βc βr βc βr β β β β β β r RMSE L RMSE U Table 11: Blood pressure data, Inferences concerning regression coefficients by MCM Z F p-value Z 1 p-value Z 2 p-value The consistency across all the methods for the estimated coefficients found in the previous example is no longer observed for this data set. According to the three measures, the SCM shows the best and BCRM the worst performance. However, Table 11 suggests that the model is not significant at α =.05, which makes the aforementioned measures uninterpretable. The tests of individual coefficients are consistent to the overall F -test result as their p-values are and 21

22 for systolic pressure and diastolic pressure, respectively. The histograms and Q-Q plots of the estimates from 1000 Monte Carlo replications are shown in Figure 3. The systolic pressure and diastolic pressure coefficients are roughly normally distributed, but some outlying cases are identified, suggesting somewhat mild non-normal behaviors. 6 Conclusion In this paper, we propose a new method for fitting a linear regression model on interval-valued data using Monte Carlo simulation. The proposed method randomly generates a large number, say B, of single-valued data sets, each of which consists of points randomly chosen within observed intervals. For each of the B data sets, a classical regression model is fitted by the least squares estimation. The final model for the original interval-valued data is then obtained by aggregating the B estimated models. In this way, the proposed method fully utilizes internal variation in the interval-valued observations. Furthermore, using the sampling distributions obtained from the random sampling process, one can carry out statistical inference on regression coefficients such as confidence intervals and hypothesis tests in a straightforward way. The backbone idea of this work can be readily extended to other statistical problems such as principal component analysis, classification, and clustering, etc. For example, for binary classification, one can obtain B classification rules f 1,..., f B based on single-valued data sets generated by Monte Carlo sampling used in this paper. Then predicting the class label for a new interval-valued observation can be made as follows: 1) randomly generate a single point from the given interval and apply f j, j = 1,..., B and record the majority vote among the B predicted labels. 2) repeat the previous step for a reasonably large number of times, say C. Finally the predicted label for the new interval is determined by the majority vote among the C predictions. Multi-category classification can be approached in a similar way. One possible drawback of the proposed approach is that it is computationally intensive. By the nature of Monte Carlo simulation, a larger number of repetitions is always desired. However, we believe that this disadvantage is outweighed by its good properties. First, it relieves the need to develop complex methodologies for interval-valued data. Second, it can be immediately applied to another popular type of symbolic data, which is histogram-valued data. Third, it is flexible in the 22

23 sense that one can modify the Monte Carlo resampling scheme depending on specific problems. For example, one may assume a non-uniform distribution within an interval such as truncated normal distribution. References Alfonso, F., Billard, L., and Diday, E. (2004). Symbolic linear regression with taxonomies. In Banks, D., House, L., McMorris, F., Arabie, P., and Gaul, W., editors, Classification, Clustering and Data Mining Applications, pages Springer-Verlag, Berlin. Bertrand, P. and Goupil, F. (2000). Descriptive statistics for symbolic data. In Bock, H.-H. and Diday, E., editors, Analysis of Symbolic Data, pages Springer-Verlag, Berlin. Billard, L. (2007). Dependencies and variation components of symbolic interval-valued data. In Brito, P., Cucumel, G., Bertrand, P., and de Carvalho, F., editors, Selected Contributions in Data Analysis and Classification, pages Springer-Verlag, Berlin. Billard, L. (2008). Sample covariance functions for complex quantitative data. In World Congress, International Association of Computational Statistics, Yokohama, Japan. Billard, L. (2011). Brief overview of symbolic data and analytic issues. Statistical Analysis and Data Mining, 4: Billard, L. and Diday, E. (2000). Regression analysis for interval-valued data. In Kiers, H. A. L., Rassoon, J.-P., Groenen, P. J. F., and Schader, M., editors, Data Analysis, Classification, and Related Methods, pages Springer-Verlag, Berlin. Billard, L. and Diday, E. (2002). Symbolic regression analysis. In Classification, Clustering and Data Analysis: Proceedings of the 8th Conference of the International Federation of Classification Societies (IFCS 02), pages Springer, Poland. Billard, L. and Diday, E. (2003). From the statistics of data to the statistics of knowledge: Symbolic data analysis. Journal of the America Statistical Association, 98: Billard, L. and Diday, E. (2007). Symbolic Data Analysis: Conceptual Statistics and Data Mining. Wiley, Chichester. 23

24 Diday, E. (1995). Probabilist, possibilist and belief object for knowledge analysis. Annals of Operations Reseatch, 55: Diday, E. (2003). An introduction to symbolic data analysis and the sodas software. Journal of Symbolic Data Analysis, 7: Diday, E. and Emilion, R. (1996). Lattices and capacities in analysis of probabilist object. In Diday, E., Lechevallier, Y., and Opilz, O., editors, Studies in Classification, pages Diday, E. and Emilion, R. (1998). Capacities and credibilities in analysis of probabilistic objects by histograms and lattices. In Hayashi, C., Ohsumi, N., Yajima, K., Tanaka, Y., Bock, H. H., and Baba, Y., editors, Data Science, Classification, and Related Methods, pages Good, P. (2006). Resampling Methods. Birkhauser. Lima Neto, E.,, Cordeiro, G., and de Carvalho, F. (2011). Bivariate symbolic regression models for interval-valued variables. Journal of Statistical Computation and Simulation, 81: Lima Neto, E. and de Carvalho, F. (2008). Center and range method for fitting a linear regression model to symbolic interval data. Computational Statistics and Data Analysis, 52: Lima Neto, E. and de Carvalho, F. (2010). Constrained linear regression models for symbolic interval-valued variables. Computational Statistics and Data Analysis, 54: Lima Neto, E. A., Cordeiro, G. M., Carvalho, F. A. T., Anjos, U., and Costa, A. (2009). Bivariate generalized linear model for interval-valued variables. In Proceedings 2009 IEEE International Joint Conference on Neural Networks, volume 1, pages , Atlanta, USA. Lima Neto, E. A., de Carvalho, F. A. T., and Tenorio, C. P. (2004). Univariate and multivariate linear regression methods to pridict interval-valued features. In Lecture Notes in Computer Science, AI 2004 Advances in Artificial Intelligence, pages Springer-Verlag, Berlin. Maia, A. and Carvalho, F. D. (2008). Fitting a least absolute deviation regression model on symbolic interval data. In Lecture Notes in Artificial Intelligence: Proceedings of the Ninth Brazilian Symposium on Artificial Intelligence, pages Springer-Verlag, Berlin. 24

25 Noirhomme-Fraiture, M. and Brito, P. (2011). Far beyond the classical data models: symbolic data analysis. Statistical Analysis and Data Mining, 4: Silva, A., Lima Neto, E. A., and Anjos, U. (2011). A regression model to interval-valued variables based on copula approach. In Proceedings of the 58th World Statistics Congress of the International Statistical Institute, Dublin, Ireland. Xu, W. (2010). Symbolic Data Analysis: Interval-Valued Data Regression. PhD thesis, University of Georgia. 25

26 Density Head Coefficient Head Coefficient Theoretical Quantiles (a) Histogram for Head (b) Q-Q plot for Head Density Tail Coefficient Tail Coefficient Theoretical Quantiles (c) Histogram for Tail (d) Q-Q plot for Tail Density Forearm Coefficient Forearm Coefficient Theoretical Quantiles (e) Histogram for Forearm (f) Q-Q plot for Forearm Figure 2: Bats species data, Histograms and Q-Q plots of the coefficient estimates from 1000 resampling data sets. 26

27 Density Systolic Pressure Coefficient Systolic Pressure Coefficient Theoretical Quantiles (a) Histogram for Systolic Pressure (b) Q-Q plot for Systolic Pressure Density Diastolic Pressure Coefficient Diastolic Pressure Coefficient Theoretical Quantiles (c) Histogram for Diastolic Pressure (d) Q-Q plot for Diastolic Pressure Figure 3: Blood Pressure data, Histograms and Q-Q plots of the coefficient estimates from 1000 resampling data sets. 27

A Nonparametric Kernel Approach to Interval-Valued Data Analysis

A Nonparametric Kernel Approach to Interval-Valued Data Analysis A Nonparametric Kernel Approach to Interval-Valued Data Analysis Yongho Jeon Department of Applied Statistics, Yonsei University, Seoul, 120-749, Korea Jeongyoun Ahn, Cheolwoo Park Department of Statistics,

More information

Modelling and Analysing Interval Data

Modelling and Analysing Interval Data Modelling and Analysing Interval Data Paula Brito Faculdade de Economia/NIAAD-LIACC, Universidade do Porto Rua Dr. Roberto Frias, 4200-464 Porto, Portugal mpbrito@fep.up.pt Abstract. In this paper we discuss

More information

A new linear regression model for histogram-valued variables

A new linear regression model for histogram-valued variables Int. Statistical Inst.: Proc. 58th World Statistical Congress, 011, Dublin (Session CPS077) p.5853 A new linear regression model for histogram-valued variables Dias, Sónia Instituto Politécnico Viana do

More information

MA 575 Linear Models: Cedric E. Ginestet, Boston University Non-parametric Inference, Polynomial Regression Week 9, Lecture 2

MA 575 Linear Models: Cedric E. Ginestet, Boston University Non-parametric Inference, Polynomial Regression Week 9, Lecture 2 MA 575 Linear Models: Cedric E. Ginestet, Boston University Non-parametric Inference, Polynomial Regression Week 9, Lecture 2 1 Bootstrapped Bias and CIs Given a multiple regression model with mean and

More information

Dependencies in Interval-valued. valued Symbolic Data. Lynne Billard University of Georgia

Dependencies in Interval-valued. valued Symbolic Data. Lynne Billard University of Georgia Dependencies in Interval-valued valued Symbolic Data Lynne Billard University of Georgia lynne@stat.uga.edu Tribute to Professor Edwin Diday: Paris, France; 5 September 2007 Naturally occurring Symbolic

More information

An Exploratory Data Analysis in Scale-Space for Interval-Valued Data

An Exploratory Data Analysis in Scale-Space for Interval-Valued Data An Exploratory Data Analysis in Scale-Space for Interval-Valued Data Cheolwoo Park Department of Statistics, University of Georgia, Athens, GA 362, USA Yongho Jeon Department of Applied Statistics, Yonsei

More information

Descriptive Statistics for Symbolic Data

Descriptive Statistics for Symbolic Data Outline Descriptive Statistics for Symbolic Data Paula Brito Fac. Economia & LIAAD-INESC TEC, Universidade do Porto ECI 2015 - Buenos Aires T3: Symbolic Data Analysis: Taking Variability in Data into Account

More information

Lasso-based linear regression for interval-valued data

Lasso-based linear regression for interval-valued data Int. Statistical Inst.: Proc. 58th World Statistical Congress, 2011, Dublin (Session CPS065) p.5576 Lasso-based linear regression for interval-valued data Giordani, Paolo Sapienza University of Rome, Department

More information

Discriminant Analysis for Interval Data

Discriminant Analysis for Interval Data Outline Discriminant Analysis for Interval Data Paula Brito Fac. Economia & LIAAD-INESC TEC, Universidade do Porto ECI 2015 - Buenos Aires T3: Symbolic Data Analysis: Taking Variability in Data into Account

More information

Regression Analysis for Data Containing Outliers and High Leverage Points

Regression Analysis for Data Containing Outliers and High Leverage Points Alabama Journal of Mathematics 39 (2015) ISSN 2373-0404 Regression Analysis for Data Containing Outliers and High Leverage Points Asim Kumer Dey Department of Mathematics Lamar University Md. Amir Hossain

More information

Simple Linear Regression

Simple Linear Regression Simple Linear Regression ST 430/514 Recall: A regression model describes how a dependent variable (or response) Y is affected, on average, by one or more independent variables (or factors, or covariates)

More information

Lecture 5: Clustering, Linear Regression

Lecture 5: Clustering, Linear Regression Lecture 5: Clustering, Linear Regression Reading: Chapter 10, Sections 3.1-3.2 STATS 202: Data mining and analysis October 4, 2017 1 / 22 .0.0 5 5 1.0 7 5 X2 X2 7 1.5 1.0 0.5 3 1 2 Hierarchical clustering

More information

Linear Regression. In this lecture we will study a particular type of regression model: the linear regression model

Linear Regression. In this lecture we will study a particular type of regression model: the linear regression model 1 Linear Regression 2 Linear Regression In this lecture we will study a particular type of regression model: the linear regression model We will first consider the case of the model with one predictor

More information

Lecture 5: Clustering, Linear Regression

Lecture 5: Clustering, Linear Regression Lecture 5: Clustering, Linear Regression Reading: Chapter 10, Sections 3.1-3.2 STATS 202: Data mining and analysis October 4, 2017 1 / 22 Hierarchical clustering Most algorithms for hierarchical clustering

More information

Application of Variance Homogeneity Tests Under Violation of Normality Assumption

Application of Variance Homogeneity Tests Under Violation of Normality Assumption Application of Variance Homogeneity Tests Under Violation of Normality Assumption Alisa A. Gorbunova, Boris Yu. Lemeshko Novosibirsk State Technical University Novosibirsk, Russia e-mail: gorbunova.alisa@gmail.com

More information

CHAPTER 5. Outlier Detection in Multivariate Data

CHAPTER 5. Outlier Detection in Multivariate Data CHAPTER 5 Outlier Detection in Multivariate Data 5.1 Introduction Multivariate outlier detection is the important task of statistical analysis of multivariate data. Many methods have been proposed for

More information

Lecture 5: Clustering, Linear Regression

Lecture 5: Clustering, Linear Regression Lecture 5: Clustering, Linear Regression Reading: Chapter 10, Sections 3.1-2 STATS 202: Data mining and analysis Sergio Bacallado September 19, 2018 1 / 23 Announcements Starting next week, Julia Fukuyama

More information

AMS 315/576 Lecture Notes. Chapter 11. Simple Linear Regression

AMS 315/576 Lecture Notes. Chapter 11. Simple Linear Regression AMS 315/576 Lecture Notes Chapter 11. Simple Linear Regression 11.1 Motivation A restaurant opening on a reservations-only basis would like to use the number of advance reservations x to predict the number

More information

Section 3: Simple Linear Regression

Section 3: Simple Linear Regression Section 3: Simple Linear Regression Carlos M. Carvalho The University of Texas at Austin McCombs School of Business http://faculty.mccombs.utexas.edu/carlos.carvalho/teaching/ 1 Regression: General Introduction

More information

Lecture 18: Simple Linear Regression

Lecture 18: Simple Linear Regression Lecture 18: Simple Linear Regression BIOS 553 Department of Biostatistics University of Michigan Fall 2004 The Correlation Coefficient: r The correlation coefficient (r) is a number that measures the strength

More information

Lossless Online Bayesian Bagging

Lossless Online Bayesian Bagging Lossless Online Bayesian Bagging Herbert K. H. Lee ISDS Duke University Box 90251 Durham, NC 27708 herbie@isds.duke.edu Merlise A. Clyde ISDS Duke University Box 90251 Durham, NC 27708 clyde@isds.duke.edu

More information

Confidence Intervals, Testing and ANOVA Summary

Confidence Intervals, Testing and ANOVA Summary Confidence Intervals, Testing and ANOVA Summary 1 One Sample Tests 1.1 One Sample z test: Mean (σ known) Let X 1,, X n a r.s. from N(µ, σ) or n > 30. Let The test statistic is H 0 : µ = µ 0. z = x µ 0

More information

Applied Statistics and Econometrics

Applied Statistics and Econometrics Applied Statistics and Econometrics Lecture 6 Saul Lach September 2017 Saul Lach () Applied Statistics and Econometrics September 2017 1 / 53 Outline of Lecture 6 1 Omitted variable bias (SW 6.1) 2 Multiple

More information

Linear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept,

Linear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept, Linear Regression In this problem sheet, we consider the problem of linear regression with p predictors and one intercept, y = Xβ + ɛ, where y t = (y 1,..., y n ) is the column vector of target values,

More information

Bayesian Linear Regression

Bayesian Linear Regression Bayesian Linear Regression Sudipto Banerjee 1 Biostatistics, School of Public Health, University of Minnesota, Minneapolis, Minnesota, U.S.A. September 15, 2010 1 Linear regression models: a Bayesian perspective

More information

Monte Carlo Studies. The response in a Monte Carlo study is a random variable.

Monte Carlo Studies. The response in a Monte Carlo study is a random variable. Monte Carlo Studies The response in a Monte Carlo study is a random variable. The response in a Monte Carlo study has a variance that comes from the variance of the stochastic elements in the data-generating

More information

Subject CS1 Actuarial Statistics 1 Core Principles

Subject CS1 Actuarial Statistics 1 Core Principles Institute of Actuaries of India Subject CS1 Actuarial Statistics 1 Core Principles For 2019 Examinations Aim The aim of the Actuarial Statistics 1 subject is to provide a grounding in mathematical and

More information

What s New in Econometrics. Lecture 13

What s New in Econometrics. Lecture 13 What s New in Econometrics Lecture 13 Weak Instruments and Many Instruments Guido Imbens NBER Summer Institute, 2007 Outline 1. Introduction 2. Motivation 3. Weak Instruments 4. Many Weak) Instruments

More information

Data Mining Stat 588

Data Mining Stat 588 Data Mining Stat 588 Lecture 02: Linear Methods for Regression Department of Statistics & Biostatistics Rutgers University September 13 2011 Regression Problem Quantitative generic output variable Y. Generic

More information

Math 423/533: The Main Theoretical Topics

Math 423/533: The Main Theoretical Topics Math 423/533: The Main Theoretical Topics Notation sample size n, data index i number of predictors, p (p = 2 for simple linear regression) y i : response for individual i x i = (x i1,..., x ip ) (1 p)

More information

Data Analyses in Multivariate Regression Chii-Dean Joey Lin, SDSU, San Diego, CA

Data Analyses in Multivariate Regression Chii-Dean Joey Lin, SDSU, San Diego, CA Data Analyses in Multivariate Regression Chii-Dean Joey Lin, SDSU, San Diego, CA ABSTRACT Regression analysis is one of the most used statistical methodologies. It can be used to describe or predict causal

More information

Machine Learning Linear Classification. Prof. Matteo Matteucci

Machine Learning Linear Classification. Prof. Matteo Matteucci Machine Learning Linear Classification Prof. Matteo Matteucci Recall from the first lecture 2 X R p Regression Y R Continuous Output X R p Y {Ω 0, Ω 1,, Ω K } Classification Discrete Output X R p Y (X)

More information

Algorithm-Independent Learning Issues

Algorithm-Independent Learning Issues Algorithm-Independent Learning Issues Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2007 c 2007, Selim Aksoy Introduction We have seen many learning

More information

EEC 686/785 Modeling & Performance Evaluation of Computer Systems. Lecture 19

EEC 686/785 Modeling & Performance Evaluation of Computer Systems. Lecture 19 EEC 686/785 Modeling & Performance Evaluation of Computer Systems Lecture 19 Department of Electrical and Computer Engineering Cleveland State University wenbing@ieee.org (based on Dr. Raj Jain s lecture

More information

BANA 7046 Data Mining I Lecture 2. Linear Regression, Model Assessment, and Cross-validation 1

BANA 7046 Data Mining I Lecture 2. Linear Regression, Model Assessment, and Cross-validation 1 BANA 7046 Data Mining I Lecture 2. Linear Regression, Model Assessment, and Cross-validation 1 Shaobo Li University of Cincinnati 1 Partially based on Hastie, et al. (2009) ESL, and James, et al. (2013)

More information

Histogram data analysis based on Wasserstein distance

Histogram data analysis based on Wasserstein distance Histogram data analysis based on Wasserstein distance Rosanna Verde Antonio Irpino Department of European and Mediterranean Studies Second University of Naples Caserta - ITALY SYMPOSIUM ON LEARNING AND

More information

Outline. Simulation of a Single-Server Queueing System. EEC 686/785 Modeling & Performance Evaluation of Computer Systems.

Outline. Simulation of a Single-Server Queueing System. EEC 686/785 Modeling & Performance Evaluation of Computer Systems. EEC 686/785 Modeling & Performance Evaluation of Computer Systems Lecture 19 Outline Simulation of a Single-Server Queueing System Review of midterm # Department of Electrical and Computer Engineering

More information

Review of Statistics 101

Review of Statistics 101 Review of Statistics 101 We review some important themes from the course 1. Introduction Statistics- Set of methods for collecting/analyzing data (the art and science of learning from data). Provides methods

More information

Variational Principal Components

Variational Principal Components Variational Principal Components Christopher M. Bishop Microsoft Research 7 J. J. Thomson Avenue, Cambridge, CB3 0FB, U.K. cmbishop@microsoft.com http://research.microsoft.com/ cmbishop In Proceedings

More information

Related Concepts: Lecture 9 SEM, Statistical Modeling, AI, and Data Mining. I. Terminology of SEM

Related Concepts: Lecture 9 SEM, Statistical Modeling, AI, and Data Mining. I. Terminology of SEM Lecture 9 SEM, Statistical Modeling, AI, and Data Mining I. Terminology of SEM Related Concepts: Causal Modeling Path Analysis Structural Equation Modeling Latent variables (Factors measurable, but thru

More information

401 Review. 6. Power analysis for one/two-sample hypothesis tests and for correlation analysis.

401 Review. 6. Power analysis for one/two-sample hypothesis tests and for correlation analysis. 401 Review Major topics of the course 1. Univariate analysis 2. Bivariate analysis 3. Simple linear regression 4. Linear algebra 5. Multiple regression analysis Major analysis methods 1. Graphical analysis

More information

Histogram data analysis based on Wasserstein distance

Histogram data analysis based on Wasserstein distance Histogram data analysis based on Wasserstein distance Rosanna Verde Antonio Irpino Department of European and Mediterranean Studies Second University of Naples Caserta - ITALY Aims Introduce: New distances

More information

Likelihood ratio testing for zero variance components in linear mixed models

Likelihood ratio testing for zero variance components in linear mixed models Likelihood ratio testing for zero variance components in linear mixed models Sonja Greven 1,3, Ciprian Crainiceanu 2, Annette Peters 3 and Helmut Küchenhoff 1 1 Department of Statistics, LMU Munich University,

More information

Mathematics for Economics MA course

Mathematics for Economics MA course Mathematics for Economics MA course Simple Linear Regression Dr. Seetha Bandara Simple Regression Simple linear regression is a statistical method that allows us to summarize and study relationships between

More information

Inference. ME104: Linear Regression Analysis Kenneth Benoit. August 15, August 15, 2012 Lecture 3 Multiple linear regression 1 1 / 58

Inference. ME104: Linear Regression Analysis Kenneth Benoit. August 15, August 15, 2012 Lecture 3 Multiple linear regression 1 1 / 58 Inference ME104: Linear Regression Analysis Kenneth Benoit August 15, 2012 August 15, 2012 Lecture 3 Multiple linear regression 1 1 / 58 Stata output resvisited. reg votes1st spend_total incumb minister

More information

Analysing data: regression and correlation S6 and S7

Analysing data: regression and correlation S6 and S7 Basic medical statistics for clinical and experimental research Analysing data: regression and correlation S6 and S7 K. Jozwiak k.jozwiak@nki.nl 2 / 49 Correlation So far we have looked at the association

More information

Tutorial on Approximate Bayesian Computation

Tutorial on Approximate Bayesian Computation Tutorial on Approximate Bayesian Computation Michael Gutmann https://sites.google.com/site/michaelgutmann University of Helsinki Aalto University Helsinki Institute for Information Technology 16 May 2016

More information

Least Absolute Value vs. Least Squares Estimation and Inference Procedures in Regression Models with Asymmetric Error Distributions

Least Absolute Value vs. Least Squares Estimation and Inference Procedures in Regression Models with Asymmetric Error Distributions Journal of Modern Applied Statistical Methods Volume 8 Issue 1 Article 13 5-1-2009 Least Absolute Value vs. Least Squares Estimation and Inference Procedures in Regression Models with Asymmetric Error

More information

WISE MA/PhD Programs Econometrics Instructor: Brett Graham Spring Semester, Academic Year Exam Version: A

WISE MA/PhD Programs Econometrics Instructor: Brett Graham Spring Semester, Academic Year Exam Version: A WISE MA/PhD Programs Econometrics Instructor: Brett Graham Spring Semester, 2016-17 Academic Year Exam Version: A INSTRUCTIONS TO STUDENTS 1 The time allowed for this examination paper is 2 hours. 2 This

More information

Extra Exam Empirical Methods VU University Amsterdam, Faculty of Exact Sciences , July 2, 2015

Extra Exam Empirical Methods VU University Amsterdam, Faculty of Exact Sciences , July 2, 2015 Extra Exam Empirical Methods VU University Amsterdam, Faculty of Exact Sciences 12.00 14.45, July 2, 2015 Also hand in this exam and your scrap paper. Always motivate your answers. Write your answers in

More information

Generalization of the Principal Components Analysis to Histogram Data

Generalization of the Principal Components Analysis to Histogram Data Generalization of the Principal Components Analysis to Histogram Data Oldemar Rodríguez 1, Edwin Diday 1, and Suzanne Winsberg 2 1 University Paris 9 Dauphine, Ceremade Pl Du Ml de L de Tassigny 75016

More information

Evaluation. Andrea Passerini Machine Learning. Evaluation

Evaluation. Andrea Passerini Machine Learning. Evaluation Andrea Passerini passerini@disi.unitn.it Machine Learning Basic concepts requires to define performance measures to be optimized Performance of learning algorithms cannot be evaluated on entire domain

More information

ECON The Simple Regression Model

ECON The Simple Regression Model ECON 351 - The Simple Regression Model Maggie Jones 1 / 41 The Simple Regression Model Our starting point will be the simple regression model where we look at the relationship between two variables In

More information

Simple Linear Regression

Simple Linear Regression Simple Linear Regression ST 370 Regression models are used to study the relationship of a response variable and one or more predictors. The response is also called the dependent variable, and the predictors

More information

Approximate Bayesian Computation

Approximate Bayesian Computation Approximate Bayesian Computation Michael Gutmann https://sites.google.com/site/michaelgutmann University of Helsinki and Aalto University 1st December 2015 Content Two parts: 1. The basics of approximate

More information

Constructing Prediction Intervals for Random Forests

Constructing Prediction Intervals for Random Forests Senior Thesis in Mathematics Constructing Prediction Intervals for Random Forests Author: Benjamin Lu Advisor: Dr. Jo Hardin Submitted to Pomona College in Partial Fulfillment of the Degree of Bachelor

More information

Simple and Multiple Linear Regression

Simple and Multiple Linear Regression Sta. 113 Chapter 12 and 13 of Devore March 12, 2010 Table of contents 1 Simple Linear Regression 2 Model Simple Linear Regression A simple linear regression model is given by Y = β 0 + β 1 x + ɛ where

More information

Markov Chain Monte Carlo methods

Markov Chain Monte Carlo methods Markov Chain Monte Carlo methods By Oleg Makhnin 1 Introduction a b c M = d e f g h i 0 f(x)dx 1.1 Motivation 1.1.1 Just here Supresses numbering 1.1.2 After this 1.2 Literature 2 Method 2.1 New math As

More information

A NONPARAMETRIC TEST FOR HOMOGENEITY: APPLICATIONS TO PARAMETER ESTIMATION

A NONPARAMETRIC TEST FOR HOMOGENEITY: APPLICATIONS TO PARAMETER ESTIMATION Change-point Problems IMS Lecture Notes - Monograph Series (Volume 23, 1994) A NONPARAMETRIC TEST FOR HOMOGENEITY: APPLICATIONS TO PARAMETER ESTIMATION BY K. GHOUDI AND D. MCDONALD Universite' Lava1 and

More information

Evaluation requires to define performance measures to be optimized

Evaluation requires to define performance measures to be optimized Evaluation Basic concepts Evaluation requires to define performance measures to be optimized Performance of learning algorithms cannot be evaluated on entire domain (generalization error) approximation

More information

complex data Edwin Diday, University i Paris-Dauphine, France CEREMADE, Beijing 2011

complex data Edwin Diday, University i Paris-Dauphine, France CEREMADE, Beijing 2011 Symbolic data analysis of complex data Edwin Diday, CEREMADE, University i Paris-Dauphine, France Beijing 2011 OUTLINE What is the Symbolic Data Analysis (SDA) paradigm? Why SDA is a good tool for Complex

More information

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review STATS 200: Introduction to Statistical Inference Lecture 29: Course review Course review We started in Lecture 1 with a fundamental assumption: Data is a realization of a random process. The goal throughout

More information

Goodness-of-Fit Tests for the Ordinal Response Models with Misspecified Links

Goodness-of-Fit Tests for the Ordinal Response Models with Misspecified Links Communications of the Korean Statistical Society 2009, Vol 16, No 4, 697 705 Goodness-of-Fit Tests for the Ordinal Response Models with Misspecified Links Kwang Mo Jeong a, Hyun Yung Lee 1, a a Department

More information

Experimental Design and Data Analysis for Biologists

Experimental Design and Data Analysis for Biologists Experimental Design and Data Analysis for Biologists Gerry P. Quinn Monash University Michael J. Keough University of Melbourne CAMBRIDGE UNIVERSITY PRESS Contents Preface page xv I I Introduction 1 1.1

More information

A multiple testing procedure for input variable selection in neural networks

A multiple testing procedure for input variable selection in neural networks A multiple testing procedure for input variable selection in neural networks MicheleLaRoccaandCiraPerna Department of Economics and Statistics - University of Salerno Via Ponte Don Melillo, 84084, Fisciano

More information

Nature vs. nurture? Lecture 18 - Regression: Inference, Outliers, and Intervals. Regression Output. Conditions for inference.

Nature vs. nurture? Lecture 18 - Regression: Inference, Outliers, and Intervals. Regression Output. Conditions for inference. Understanding regression output from software Nature vs. nurture? Lecture 18 - Regression: Inference, Outliers, and Intervals In 1966 Cyril Burt published a paper called The genetic determination of differences

More information

Decision trees on interval valued variables

Decision trees on interval valued variables Decision trees on interval valued variables Chérif Mballo 1,2 and Edwin Diday 2 1 ESIEA Recherche, 38 Rue des Docteurs Calmette et Guerin 53000 Laval France mballo@esiea-ouest.fr 2 LISE-CEREMADE Université

More information

Geographically weighted regression approach for origin-destination flows

Geographically weighted regression approach for origin-destination flows Geographically weighted regression approach for origin-destination flows Kazuki Tamesue 1 and Morito Tsutsumi 2 1 Graduate School of Information and Engineering, University of Tsukuba 1-1-1 Tennodai, Tsukuba,

More information

Regularization and Variable Selection via the Elastic Net

Regularization and Variable Selection via the Elastic Net p. 1/1 Regularization and Variable Selection via the Elastic Net Hui Zou and Trevor Hastie Journal of Royal Statistical Society, B, 2005 Presenter: Minhua Chen, Nov. 07, 2008 p. 2/1 Agenda Introduction

More information

An overview of applied econometrics

An overview of applied econometrics An overview of applied econometrics Jo Thori Lind September 4, 2011 1 Introduction This note is intended as a brief overview of what is necessary to read and understand journal articles with empirical

More information

Heteroskedasticity-Robust Inference in Finite Samples

Heteroskedasticity-Robust Inference in Finite Samples Heteroskedasticity-Robust Inference in Finite Samples Jerry Hausman and Christopher Palmer Massachusetts Institute of Technology December 011 Abstract Since the advent of heteroskedasticity-robust standard

More information

Exam Empirical Methods VU University Amsterdam, Faculty of Exact Sciences h, February 12, 2015

Exam Empirical Methods VU University Amsterdam, Faculty of Exact Sciences h, February 12, 2015 Exam Empirical Methods VU University Amsterdam, Faculty of Exact Sciences 18.30 21.15h, February 12, 2015 Question 1 is on this page. Always motivate your answers. Write your answers in English. Only the

More information

Bootstrapping Heteroskedasticity Consistent Covariance Matrix Estimator

Bootstrapping Heteroskedasticity Consistent Covariance Matrix Estimator Bootstrapping Heteroskedasticity Consistent Covariance Matrix Estimator by Emmanuel Flachaire Eurequa, University Paris I Panthéon-Sorbonne December 2001 Abstract Recent results of Cribari-Neto and Zarkos

More information

For right censored data with Y i = T i C i and censoring indicator, δ i = I(T i < C i ), arising from such a parametric model we have the likelihood,

For right censored data with Y i = T i C i and censoring indicator, δ i = I(T i < C i ), arising from such a parametric model we have the likelihood, A NOTE ON LAPLACE REGRESSION WITH CENSORED DATA ROGER KOENKER Abstract. The Laplace likelihood method for estimating linear conditional quantile functions with right censored data proposed by Bottai and

More information

2008 Winton. Statistical Testing of RNGs

2008 Winton. Statistical Testing of RNGs 1 Statistical Testing of RNGs Criteria for Randomness For a sequence of numbers to be considered a sequence of randomly acquired numbers, it must have two basic statistical properties: Uniformly distributed

More information

Fundamentals to Biostatistics. Prof. Chandan Chakraborty Associate Professor School of Medical Science & Technology IIT Kharagpur

Fundamentals to Biostatistics. Prof. Chandan Chakraborty Associate Professor School of Medical Science & Technology IIT Kharagpur Fundamentals to Biostatistics Prof. Chandan Chakraborty Associate Professor School of Medical Science & Technology IIT Kharagpur Statistics collection, analysis, interpretation of data development of new

More information

Intermediate Econometrics

Intermediate Econometrics Intermediate Econometrics Heteroskedasticity Text: Wooldridge, 8 July 17, 2011 Heteroskedasticity Assumption of homoskedasticity, Var(u i x i1,..., x ik ) = E(u 2 i x i1,..., x ik ) = σ 2. That is, the

More information

Course Introduction and Overview Descriptive Statistics Conceptualizations of Variance Review of the General Linear Model

Course Introduction and Overview Descriptive Statistics Conceptualizations of Variance Review of the General Linear Model Course Introduction and Overview Descriptive Statistics Conceptualizations of Variance Review of the General Linear Model PSYC 943 (930): Fundamentals of Multivariate Modeling Lecture 1: August 22, 2012

More information

Motivation for multiple regression

Motivation for multiple regression Motivation for multiple regression 1. Simple regression puts all factors other than X in u, and treats them as unobserved. Effectively the simple regression does not account for other factors. 2. The slope

More information

Simple Linear Regression for the Climate Data

Simple Linear Regression for the Climate Data Prediction Prediction Interval Temperature 0.2 0.0 0.2 0.4 0.6 0.8 320 340 360 380 CO 2 Simple Linear Regression for the Climate Data What do we do with the data? y i = Temperature of i th Year x i =CO

More information

Bagging During Markov Chain Monte Carlo for Smoother Predictions

Bagging During Markov Chain Monte Carlo for Smoother Predictions Bagging During Markov Chain Monte Carlo for Smoother Predictions Herbert K. H. Lee University of California, Santa Cruz Abstract: Making good predictions from noisy data is a challenging problem. Methods

More information

A Bootstrap Test for Causality with Endogenous Lag Length Choice. - theory and application in finance

A Bootstrap Test for Causality with Endogenous Lag Length Choice. - theory and application in finance CESIS Electronic Working Paper Series Paper No. 223 A Bootstrap Test for Causality with Endogenous Lag Length Choice - theory and application in finance R. Scott Hacker and Abdulnasser Hatemi-J April 200

More information

MFin Econometrics I Session 4: t-distribution, Simple Linear Regression, OLS assumptions and properties of OLS estimators

MFin Econometrics I Session 4: t-distribution, Simple Linear Regression, OLS assumptions and properties of OLS estimators MFin Econometrics I Session 4: t-distribution, Simple Linear Regression, OLS assumptions and properties of OLS estimators Thilo Klein University of Cambridge Judge Business School Session 4: Linear regression,

More information

Simple Linear Regression

Simple Linear Regression Simple Linear Regression In simple linear regression we are concerned about the relationship between two variables, X and Y. There are two components to such a relationship. 1. The strength of the relationship.

More information

Chapter 12 - Lecture 2 Inferences about regression coefficient

Chapter 12 - Lecture 2 Inferences about regression coefficient Chapter 12 - Lecture 2 Inferences about regression coefficient April 19th, 2010 Facts about slope Test Statistic Confidence interval Hypothesis testing Test using ANOVA Table Facts about slope In previous

More information

High-dimensional regression modeling

High-dimensional regression modeling High-dimensional regression modeling David Causeur Department of Statistics and Computer Science Agrocampus Ouest IRMAR CNRS UMR 6625 http://www.agrocampus-ouest.fr/math/causeur/ Course objectives Making

More information

STAT440/840: Statistical Computing

STAT440/840: Statistical Computing First Prev Next Last STAT440/840: Statistical Computing Paul Marriott pmarriott@math.uwaterloo.ca MC 6096 February 2, 2005 Page 1 of 41 First Prev Next Last Page 2 of 41 Chapter 3: Data resampling: the

More information

A COMPARISON OF POISSON AND BINOMIAL EMPIRICAL LIKELIHOOD Mai Zhou and Hui Fang University of Kentucky

A COMPARISON OF POISSON AND BINOMIAL EMPIRICAL LIKELIHOOD Mai Zhou and Hui Fang University of Kentucky A COMPARISON OF POISSON AND BINOMIAL EMPIRICAL LIKELIHOOD Mai Zhou and Hui Fang University of Kentucky Empirical likelihood with right censored data were studied by Thomas and Grunkmier (1975), Li (1995),

More information

1 Mixed effect models and longitudinal data analysis

1 Mixed effect models and longitudinal data analysis 1 Mixed effect models and longitudinal data analysis Mixed effects models provide a flexible approach to any situation where data have a grouping structure which introduces some kind of correlation between

More information

Course Introduction and Overview Descriptive Statistics Conceptualizations of Variance Review of the General Linear Model

Course Introduction and Overview Descriptive Statistics Conceptualizations of Variance Review of the General Linear Model Course Introduction and Overview Descriptive Statistics Conceptualizations of Variance Review of the General Linear Model EPSY 905: Multivariate Analysis Lecture 1 20 January 2016 EPSY 905: Lecture 1 -

More information

Contents. Part I: Fundamentals of Bayesian Inference 1

Contents. Part I: Fundamentals of Bayesian Inference 1 Contents Preface xiii Part I: Fundamentals of Bayesian Inference 1 1 Probability and inference 3 1.1 The three steps of Bayesian data analysis 3 1.2 General notation for statistical inference 4 1.3 Bayesian

More information

Monte Carlo Study on the Successive Difference Replication Method for Non-Linear Statistics

Monte Carlo Study on the Successive Difference Replication Method for Non-Linear Statistics Monte Carlo Study on the Successive Difference Replication Method for Non-Linear Statistics Amang S. Sukasih, Mathematica Policy Research, Inc. Donsig Jang, Mathematica Policy Research, Inc. Amang S. Sukasih,

More information

LINEAR REGRESSION ANALYSIS. MODULE XVI Lecture Exercises

LINEAR REGRESSION ANALYSIS. MODULE XVI Lecture Exercises LINEAR REGRESSION ANALYSIS MODULE XVI Lecture - 44 Exercises Dr. Shalabh Department of Mathematics and Statistics Indian Institute of Technology Kanpur Exercise 1 The following data has been obtained on

More information

Review. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

Review. DS GA 1002 Statistical and Mathematical Models.   Carlos Fernandez-Granda Review DS GA 1002 Statistical and Mathematical Models http://www.cims.nyu.edu/~cfgranda/pages/dsga1002_fall16 Carlos Fernandez-Granda Probability and statistics Probability: Framework for dealing with

More information

y n 1 ( x i x )( y y i n 1 i y 2

y n 1 ( x i x )( y y i n 1 i y 2 STP3 Brief Class Notes Instructor: Ela Jackiewicz Chapter Regression and Correlation In this chapter we will explore the relationship between two quantitative variables, X an Y. We will consider n ordered

More information

Multivariate statistical methods and data mining in particle physics

Multivariate statistical methods and data mining in particle physics Multivariate statistical methods and data mining in particle physics RHUL Physics www.pp.rhul.ac.uk/~cowan Academic Training Lectures CERN 16 19 June, 2008 1 Outline Statement of the problem Some general

More information

MARGINAL HOMOGENEITY MODEL FOR ORDERED CATEGORIES WITH OPEN ENDS IN SQUARE CONTINGENCY TABLES

MARGINAL HOMOGENEITY MODEL FOR ORDERED CATEGORIES WITH OPEN ENDS IN SQUARE CONTINGENCY TABLES REVSTAT Statistical Journal Volume 13, Number 3, November 2015, 233 243 MARGINAL HOMOGENEITY MODEL FOR ORDERED CATEGORIES WITH OPEN ENDS IN SQUARE CONTINGENCY TABLES Authors: Serpil Aktas Department of

More information

Estimating σ 2. We can do simple prediction of Y and estimation of the mean of Y at any value of X.

Estimating σ 2. We can do simple prediction of Y and estimation of the mean of Y at any value of X. Estimating σ 2 We can do simple prediction of Y and estimation of the mean of Y at any value of X. To perform inferences about our regression line, we must estimate σ 2, the variance of the error term.

More information

Bayesian linear regression

Bayesian linear regression Bayesian linear regression Linear regression is the basis of most statistical modeling. The model is Y i = X T i β + ε i, where Y i is the continuous response X i = (X i1,..., X ip ) T is the corresponding

More information

A NOTE ON ROBUST ESTIMATION IN LOGISTIC REGRESSION MODEL

A NOTE ON ROBUST ESTIMATION IN LOGISTIC REGRESSION MODEL Discussiones Mathematicae Probability and Statistics 36 206 43 5 doi:0.75/dmps.80 A NOTE ON ROBUST ESTIMATION IN LOGISTIC REGRESSION MODEL Tadeusz Bednarski Wroclaw University e-mail: t.bednarski@prawo.uni.wroc.pl

More information