Journal of the Korean Data & Information Science Society 2018, 29(3), 815 825 http://dx.doi.org/10.7465/jkdi.2018.29.3.815 한국데이터정보과학회지 Outlier detection and variable selection via difference based regression model and penalized regression InHae Choi 1 Chun Gun Park 2 Kyeong Eun Lee 3 13 Department of Statistics, Kyungpook National University 2 Department of Mathematics, Kyonggi University Received 3 May 2018, revised 22 May 2018, accepted 22 May 2018 Abstract This paper studies an efficient procedure for the outlier detection and variable selection problem in linear regression. The effect of outliers is added in linear regression as a mean shift parameter, nonzero or zero constant. To fit this mean shift model, most penalized regressions have used some adaptive penalties on the parameters to shrink most of the parameters to zero. Such penalized models do select the true variables well, but do not detect the outliers correctly. To overcome this problem, we first determine a group of possibly suspected outliers using difference-based regression model (DBRM) and add the group to the linear model as the parameters of the effect of each suspected outlier. Then, we perform outlier detection and variable selection ultaneously using Lasso regression or Elastic net regression for the linear regression with the effect term of each suspected outlier added. The proposed method is more efficient than the previous penalized regression. We compare the proposed procedure with other methods using a ulation study and apply this procedure to the real data. Keywords: Difference-based regression model, Elastic net, Lasso, outliers detection, variable selection. 1. Introduction Outliers are observations that significantly differ from the others and frequently occur in the collection of real data. They distort statistical inference; for instance, ordinary least squares estimator is very susceptible to outliers (Joo and Cho, 2016). To address this, many methods have studied on outlier detection or robust techniques. We consider the mean shift linear regression, y = β 0 1 + X 1 β 1 + X 2 β 2 + γ + ɛ, where X 1 is an n p 1 design matrix of relevant predictors, X 2 is a n p 2 design matrix of irrelevant predictors, β i is a parameter vector corresponding to X i, γ is the effect vector of outliers which consists of zero or nonzero This work was extracted from Master s thesis of InHae Choi at Kyungpook National University in December 2017. 1 Master, Department of Statistics, Kyungpook National University, Daegu 41566, Korea. 2 Associate Professor, Department of Mathematics, Kyonggi University, Suwan 16227, Korea. 3 Corresponding author: Associate Professor, Department of Statistics, Kyungpook National University, Daegu 41566, Korea. E-mail: artlee@knu.ac.kr
816 InHae Choi Chun Gun Park Kyeong Eun Lee constant and ɛ is an error vector (McCann and Welsch, 2007; She and Owen, 2011; Park and Kim, 2017). In this study, we focus on an efficient procedure for the outlier detection and variable selection in the above mean shift linear regression. Recently penalized regression methods such as Lasso (Tibshirani, 1996) and Elastic net regression (Zou and Hastie, 2005) are popular for variable selection. Also, a type of these penalized regressions can be used outlier detection. For example, She and Owen (2011) proposed a nonconvex penalty function on γ i s, the effect of the outliers. Such penalties do select the true variables well, but do not detect the outliers correctly. To overcome this problem, we first determine a group of possibly suspected outliers using difference-based regression model (DBRM) (Park and Kim, 2017; Park, 2018) and add the group to the linear model as the parameters of the effect of each suspected outlier. Then, we perform outlier detection and variable selection ultaneously using Lasso regression or Elastic net regression for the linear regression with the effect term of each suspected outlier added. The proposed method is more efficient than the previous penalized regression. In the literature, some methods such as Forward Search Algorithm (Hadi and Simono, 1993) directly detect outliers without estimating the mean function, and others such as least trimmed squares (LTS) (Rousseeuw and Leroy, 1987), thresholding based iteratively procedure (Θ-IPOD) (She and Owen, 2011) indirectly identify outliers using residuals from robust regression model. While the latter methods can detect outliers by estimating the mean trend function, our outlier detection method uses the difference-based regression model without estimating the mean trend function. The rest of this article is organized as follows. In Section 2, we introduce a differencebased regression model which is used for picking out a group of possibly suspected outliers and Lasso regression and Elastic net regression for selecting important variables. Also, we describe an efficient procedure or the outlier detection and variable selection via two penalized regressions and DBRM. In Section 3, we conduct a ulation study to compare our approach with other existing methods. In Section 4, we apply the proposed method to the consumption of petrol. Finally in Section 5, we provide some conclusions. 2. Methods In this section, we will briefly review DBRM (Park and Kim, 2017), Lasso regression (Tibshirani, 1996) and Elastic net regression (Zou and Hastie, 2005) and then propose our method for outlier detection and variable selection using DBRM and Lasso or Elastic net regression. 2.1. Difference based regression model Consider a multiple linear regression model without outlier y = 1 n β 0 + X 1 β 1 + X 2 β 2 + ɛ, where β 0 is an intercept, 1 n = [1, 1,..., 1] T is the n-dimensional unit vector, X a is the n p a matrix of rank p a, β a = [β a1,..., β apa ] T is the p a -vector of coefficient, a = 1, 2, E(ɛ) = 0 and V ar(ɛ) = σ 2 R with unknown correlation matrix R and unknown variance σ 2.
Outlier detection and variable selection via DBRM and penalized regression 817 For detecting outliers in linear regression model, the mean-shift model (Cook and Weisberg, 1982) has been used for a long time. So we consider the following mean shift model: w = y + γ + ɛ = 1 n β 0 + Xβ + γ + ɛ where X = [X 1, X 2 ], β = [β T 1, β T 2 ] T and p = p 1 + p 2. DBRM detects outliers using the following property - when the jth case is outlier, the absolute value of estimated intercept of the fitted model without the jth case is large. The DBRM can be written as: D (j) w = D (j) y + D (j) γ = 1 n 1 ( γ j ) + D (j) Xβ + A (j) γ + D (j) ɛ [ ] γj = [1 n 1 : D (j) X] + A β (j) γ + D (j) ɛ, j = 1,..., n (2.1) where ( γ j ) is the intercept for detecting outliers and D (j) and A (j) are the (n 1) n matrices as follows: [ ] [ ] I D (j) = j 1 1 j 1 0 (j 1),(n j) I and A 0 (n j),(j 1) 1 n j I (j) = j 1 0 j 1 0 (j 1),(n j). n j 0 (n j),(j 1) 0 n j I n j where I a is an a a identity matrix and O a,b is an a b null matrix. Since intercept ( γ j ) is highly influenced by outliers, we propose to determine outlier candidates using the intercepts based on the difference based regression model. Note that we are not interested in other parameters. 2.2. Lasso regression Tibshirani (1996) proposed Lasso Regression Model for regression shrinkage and selection. For any fixed non-negative λ, the Lasso estimate ˆβ L is defined by ˆβ L = arg min L(λ, β) β = arg min (y i β 0 β i=1 β j x ij ) 2 + λ β j. (2.2) Here, the larger λ, the closer β 1,..., β p are to zero. That is, the Lasso regression shrinks the coefficient estimates. But it is known to have there are some limitations. First, the Lasso method selects at most n variables before it saturates at high dimensional. Second, if there is a group of highly correlated variables, then the Lasso method tends to select one variable from a group and ignore the others (Zou and Hastie, 2005). 2.3. Elastic net regression Zou and Hastie (2005) proposed Elastic net for regularization and variable selection to overcome these limitations of Lasso. For any fixed non-negative λ 1 and λ 2, the Elastic net
818 InHae Choi Chun Gun Park Kyeong Eun Lee estimate ˆβ E is defined by ˆβ E = arg min L(λ 1, λ 2, β) β = arg min (y i β 0 β i=1 β j x ij ) 2 + λ 1 βj 2 + λ 2 β j. (2.3) The Elastic net method allows sparsity and grouping effect together (Zou and Hastie, 2005). 2.4. Algorithm We propose to use DBRM method to select outlier candidates and to perform outlier detection and variable selection ultaneously using Lasso regression or Elastic net method. This approach consists of the following steps: Step 1. Estimation of intercepts in DBRM. D (j) w = 1 n 1 ( γ j ) + D (j) Xβ + A (j) γ + D (j) ɛ. 1.1. Estimate intercepts, ˆγ j, j = 1,..., n and rewrite δ j = abs( ˆγ j ). 1.2. Ascend them, δ (j), δ (1) δ (n). Step 2. Determination outlier candidates. 2.1. Set the percentage of outlier candidates. 2.2. Consider ones with large δ (j) value as outlier candidates. 2.3. Include outlier candidates indicator variables in the multiple linear model. Step 3. Outlier detection and variable selection using Lasso or Elastic net. w = Xβ + γ 1 z 1 + + γ k z k + ɛ where z k is an indicator vector of outlier candidate. 3.1. Find Lasso estimates: (w i β 0 i=1 β j x ij 3.2. Find Elastic net estimates: k γ j z ij ) 2 + λ( β j + k γ j ). (w i β 0 β j x ij i=1 k k k γ j z ij ) 2 + λ 1 ( βj 2 + γj 2 ) + λ 2 ( β j + γ j ). j In step 3, we need to set the optimal λ for the Lasso regression and Elastic net. So, we use a method to obtain the optimal λ in a multiple regression model without outliers.
Outlier detection and variable selection via DBRM and penalized regression 819 3. Simulation studies We conduct ulations to compare the performance of our method with the other existing methods. We compare the outlier detection with the following four methods and compare the variable selection with LAD Lasso : Least trimmed squares (LT S) (Rousseeuw and Leroy, 1987), LAD regression with the Lasso penalty (LAD Lasso ) (Wang et al., 2007), Hard thresholding (denoted by θ) based iteratively procedure for outlier detection (HIP OD) (She and Owen, 2011), Soft thresholding (denoted by θ) based iteratively procedure for outlier detection (SIP OD) (She and Owen, 2011). 3.1. Simulation setting We consider three cases of samples sizes (n = 30, 100, 300) each with 10% outliers and p = n 10 covariates (p 1 = p 2 = p 2 ). Covariates are generated from multivariate normal distribution with mean 0 and covariance matrix Σ = {σ ij }, σ ij = ρ i j. We consider two cases of correlation of covariates (ρ = 0, 0.5) and two different errors distributions (N(0, 1), t df=2 ). So the total number of ulation cases is 12. Outlier locations and signs are randomly selected and outlier sizes are 8 or 10. We generate 100 data sets for each case. We set the number of outlier candidate group as 30% of the sample size. 3.2. Criteria Let n O be the number of true outliers, n D be the number of detected outliers, n CD be the number of correctly detected outliers, n ID be the number of incorrectly detected outliers, n IU be the number of incorrectly undetected outliers and n CU be the number of correctly undetected non-outliers. Detection Table 3.1 Outlier detection True Outlier Non-outlier Sum Outlier n CD n ID n D Non-outlier n IU n CU n UD Sum n O n NO n Two kinds of errors can occur in outlier detection: masking (a true outlier is not detected) and swamping (a non-outlier is detected as an outlier). We define relative frequencies of perfection, only masking, only swamping, masking and swamping and complete failure in order to show the strength of our proposed method as follows: - Relative frequency of perfect detection: P D = 1 I(n CD(s) = n O(s) ). n - Relative frequency of only-swamping with detection (overdetection): OS = 1 I(n IU(s) = 0, n ID(s) > 0, n CD(s) > 0). n
820 InHae Choi Chun Gun Park Kyeong Eun Lee - Relative frequency of only-masking with partial detection: OM = 1 I(n IU(s) > 0, n ID(s) = 0, n CD(s) > 0). n - Relative frequency of masking and swamping with partial detection: MS = 1 I(n IU(s) > 0, n ID(s) > 0, n CD(s) > 0). n - Relative frequency of complete failure relative frequency: CF = 1 I(n CD(s) = 0). n Finally, we use three criteria for variable selection. Let CS be the relative frequency of correct selection, CR be relative frequency of correct variable reduction and AN be the average number of selected variables: - CS : the relative frequency of correct selection: CS = 1 ( I {j : n ˆβ ) j,s 0} = {j : β j 0}. - CR : the relative frequency of correct variable reduction: CR = 1 ( I {j : n ˆβ ) j,s 0} {j : β j 0}. - AN : the average number of selected variables: 3.3. Simulation result AN = 1 n ( #{j : ˆβ ) j,s 0}. We show the results of the 12 ulation cases. Table 3.2 is the results of outliers detection. METHOD1 combines the Difference-Based Regression Model with the Lasso regression and METHOD2 combines Difference-Based Regression Model with the Elastic net regression. Regardless of sample size or other conditions, our method has comparable performance, mainly in detection power. And the larger the number of samples and the correlation between variables, the better the performance of detecting outlier. Although our performance is not superior to other methods, overall performance is comparable. Next, we show the result of the variable selection. Table 3.3 is the results of variable selection.
Outlier detection and variable selection via DBRM and penalized regression 821 Our method gets comparable results with other existing methods through ulation studies and it is superior in most cases. Although the number of variables selected is larger than the number of important variables, the rate of correct variable reduction is much better than the other methods. Table 3.2 Comparison of outlier detection ɛ ρ n Criteria LTS LAD Lasso HIPOD SIPOD METHOD1 METHOD2 PD 0 0 0.02 0.46 0.29 0.17 OS 0 0 0.98 0.53 0.70 0.83 n = 30 OM 0 0 0 0.01 0.01 0 PD 0 0 0.02 0.15 0 0 OS 0 0 0.98 0.85 1 1 ρ = 0 n = 100 OM 0 0 0 0.01 0.01 0 PD 0 0 0.03 0.02 0 0 OS 0 0 0.97 0.98 1 1 n = 300 OM 0 0 0 0 0 0 N(0, 1) PD 0 0 0.02 0.46 0.18 0.11 OS 0 0 0.98 0.53 0.81 0.88 n = 30 OM 0 0 0 0.01 0.01 0.01 PD 0 0 0.02 0.15 0.21 0.11 OS 0 0 0.98 0.85 0.79 0.89 ρ = 0.5 n = 100 OM 0 0 0 0 0 0 PD 0 0 0.03 0.02 0.13 0.08 OS 0 0 0.97 0.98 0.87 0.92 n = 300 OM 0 0 0 0 0 0 PD 0 0 0.01 0.09 0.06 0.05 OS 0 0 0.99 0.85 0.90 0.93 n = 30 OM 0 0 0 0.01 0.01 0 MS 0 0 0 0 0.03 0.02 CF 1 1 0 0.05 0 0 PD 0 0 0 0 0 0 OS 0 0 1 0.81 0.92 0.92 t df=2 ρ = 0 n = 100 OM 0 0 0 0.02 0.03 0.01 MS 0 0 0 0 0.03 0.06 CF 1 1 0 0.16 0.02 0.01 PD 0 0 0 0 0 0 OS 0 0 1 0.37 0.93 0.93 n = 300 OM 0 0 0 0.05 0 0 MS 0 0 0 0 0.05 0.05 CF 1 1 0 0.58 0.02 0.02
822 InHae Choi Chun Gun Park Kyeong Eun Lee Table 3.2 Continued ɛ ρ n Criteria LTS LAD Lasso HIPOD SIPOD METHOD1 METHOD2 PD 0 0 0.01 0.09 0.05 0.01 OS 0 0 0.99 0.85 0.92 0.94 n = 30 OM 0 0 0 0.01 0.01 0.01 MS 0 0 0 0 0.02 0.04 CF 1 1 0 0.05 0 0 PD 0 0 0 0 0 0 OS 0 0 1 0.81 0.96 0.97 t df=2 ρ = 0.5 n = 100 OM 0 0 0 0.02 0.01 0.01 MS 0 0 0 0.01 0.02 0.01 CF 1 1 0 0.16 0.01 0.01 PS 0 0 0 0 0 0 OS 0 0 1 0.37 0.92 0.93 n = 300 OM 0 0 0 0.05 0 0 MS 0 0 0 0 0.06 0.05 CF 1 1 0 0.58 0.02 0.02 Table 3.3 Comparison of variable selection ɛ ρ n Criteria LAD Lasso METHOD1 METHOD2 CS 0.46 0.40 0.48 n = 30 CR 0.59 0.68 0.80 AN 2.76 1.92 2.1 CS 0.44 0.61 0.47 ρ = 0 n = 100 CR 0.53 0.94 0.96 AN 1.73 2.28 2.47 CS 0.44 0.61 0.47 n = 300 CR 0.53 0.94 0.96 N(0, 1) AN 1.73 2.28 2.47 CS 0.44 0.61 0.47 n = 30 CR 0.53 0.94 0.96 AN 1.73 2.28 2.47 CS 0.44 0.61 0.47 ρ = 0.5 n = 100 CR 0.53 0.94 0.96 AN 1.73 2.28 2.47 CS 0.44 0.61 0.47 n = 300 CR 0.53 0.94 0.96 AN 1.73 2.28 2.47 CS 0.34 0.31 0.3 n = 30 CR 0.44 0.59 0.63 AN 1.62 1.76 1.89 CS 0.31 0.39 0.33 t df=2 ρ = 0 n = 100 CR 0.42 0.80 0.85 AN 1.68 2.25 2.37 CS 0.31 0.39 0.33 n = 300 CR 0.42 0.80 0.85 AN 1.68 2.25 2.37
Outlier detection and variable selection via DBRM and penalized regression 823 Table 3.3 Continued ɛ ρ n Criteria LAD Lasso METHOD1 METHOD2 CS 0.31 0.39 0.33 n = 30 CR 0.42 0.80 0.85 AN 1.68 2.25 2.37 CS 0.31 0.39 0.33 t df=2 ρ = 0.5 n = 100 CR 0.42 0.80 0.85 AN 1.68 2.25 2.37 CS 0.31 0.39 0.33 n = 300 CR 0.42 0.80 0.85 AN 1.68 2.25 2.37 4. Real data analysis We apply our procedure to the consumption of fuel. Because the consumption of fuel was measured in 48 states, there are 48 rows of data. The response variable (Y ) and four explanatory variables (X 1, X 2, X 3, and X 4 ) are as the followings: - Y = Consumption of fuel (millions of gallons). - X 1 = fuel tax (cents per gallon). - X 2 = Average income (dollars). - X 3 = Paved Highways (miles). - X 4 = Proportion of population with driver s licenses. Looking at Figure 4.1, we see that the 40th observation is an outlier. Table 4.1 is the result of outlier detection in fuel consumption data. Among the methods, HIPOD, SIPOD, METHOD1 and METHOD2 identify that the 40th observation is an outlier. But HIPOD, METHOD1 and METHOD2 tend to detect too many outliers. Table 4.1 The result of outlier detection in fuel consumption data Method Number of Outliers LTS 0 { } LAD Lasso 0 { } HIPOD 22 Outlier index set {5, 8, 9, 11, 15, 16, 18, 19, 20, 22, 23, 24, 31, 33, 34, 38, 39, 40, 42, 43, 44, 45} SIPOD 2 {18, 40} METHOD1 10 {5, 11, 18, 19, 33, 38, 40, 42, 44, 45} METHOD2 14 {5, 11, 15, 18, 19, 20, 33, 36, 38, 39, 40, 42, 44, 45} Table 4.2 is the results of variable selection in fuel consumption data. LAD-Lasso and METHOD1 consider X 3 as the unimportant variable and select 3 variables, METHOD2 does not find the unimportant variable and selects all 4 variables.
824 InHae Choi Chun Gun Park Kyeong Eun Lee Figure 4.1 Residual plot Table 4.2 The result of variable selection in fuel consumption data Method Number of selected variables Unselected variable LAD Lasso 3 X 3 METHOD1 3 X 3 METHOD2 4 5. Conclusion In this paper, using the properties of intercept estimator in a difference-based regression model, we propose a ultaneous procedure of an outlier detection and variable selection which is more efficient and pler than using each procedure separately. Our procedure does not require us to estimate mean function. Instead, we determine the candidate group of outliers using the intercept estimates in the difference-based regression model and add the candidate group to the model. Then, we use outlier detection and variable selection using Lasso regression and Elastic net regression. Our procedure gets results comparable with other existing methods through ulation studies, and it is superior in most cases. We apply our procedure to real data. For future study, we can extend to the nonparametric regression model for outliers detection. References Cook, R.D. and Weisberg, S. (1982). Residuals and influence in regression, Chapman & Hall, London, UK. Hadi, A. S. and Simonoff, J. S. (1993). Procedures for the identification of multiple outliers in linear models. Journal of the American Statistical Association, 88, 1264-1272. Joo, Y. S. and Cho, G-Y. (2016). Outlier detection and treatment in industrial sampling survey. Journal of the Korean Data & Information Science Society, 27, 131-142.
Outlier detection and variable selection via DBRM and penalized regression 825 McCann, L., and Welsch, R. E. (2007). Robust variable selection using least angle regression and elemental set sampling. Computational Statistics & Data Analysis, 52, 249-257. Park, C. G. (2018). Distinction of an outlier(s) using difference based regression models. Journal of the Korean Data & Information Science Society, 29, 339-350. Park, C. G. and Kim, I. (2017). Outlier detection using difference based regression Model. Manuscript. Rousseeuw, P. J., and Leroy, A. M. (1987). Robust regression and outlier detection, Wiley Series in Probability and Mathematical Statistics: Applied Probability and Statistics, John Wiley & Sons Inc, New York. She, Y. and Owen, A. B. (2011). Outlier detection using nonconvex penalized regression. Journal of the American Statistical Association, 106, 626-639. Tibshirani, R. (1996). Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society. Series B, 58, 267-288. Wang, H., Li, G. and Jiang, G. (2007). Robust regression shrinkage and consistent variable selection through the lad-lasso. Journal of Business & Economic Statistics, 25, 347-355. Zou, H and Hastie, T. (2005). Regularization and variable selection via the Elastic net. Journal of the Royal Statistical Society. Series B, 67, 301-320.