Outlier detection and variable selection via difference based regression model and penalized regression

Similar documents
PENALIZED PRINCIPAL COMPONENT REGRESSION. Ayanna Byrd. (Under the direction of Cheolwoo Park) Abstract

Effect of outliers on the variable selection by the regularized regression

Variable Selection in Restricted Linear Regression Models. Y. Tuaç 1 and O. Arslan 1

High-dimensional Ordinary Least-squares Projection for Screening Variables

Linear Model Selection and Regularization

TECHNICAL REPORT NO. 1091r. A Note on the Lasso and Related Procedures in Model Selection

Regularization and Variable Selection via the Elastic Net

Robust Variable Selection Methods for Grouped Data. Kristin Lee Seamon Lilly

Regression Analysis for Data Containing Outliers and High Leverage Points

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Smoothly Clipped Absolute Deviation (SCAD) for Correlated Variables

ESL Chap3. Some extensions of lasso

Machine Learning for OR & FE

Sparse Linear Models (10/7/13)

Sparsity Models. Tong Zhang. Rutgers University. T. Zhang (Rutgers) Sparsity Models 1 / 28

Consistent Group Identification and Variable Selection in Regression with Correlated Predictors

KANSAS STATE UNIVERSITY Manhattan, Kansas

The lasso. Patrick Breheny. February 15. The lasso Convex optimization Soft thresholding

Outlier Detection via Feature Selection Algorithms in

Different types of regression: Linear, Lasso, Ridge, Elastic net, Ro

Variable Selection for Highly Correlated Predictors

Linear Regression (9/11/13)

Sparse PCA with applications in finance

Iterative Selection Using Orthogonal Regression Techniques

MSA220/MVE440 Statistical Learning for Big Data

MS-C1620 Statistical inference

Linear model selection and regularization

A Survey of L 1. Regression. Céline Cunen, 20/10/2014. Vidaurre, Bielza and Larranaga (2013)

Bayesian Grouped Horseshoe Regression with Application to Additive Models

Linear regression methods

A Short Introduction to the Lasso Methodology

Generalized Elastic Net Regression

Outlier Detection and Robust Estimation in Nonparametric Regression

Prediction & Feature Selection in GLM

CHAPTER 5. Outlier Detection in Multivariate Data

A Significance Test for the Lasso

Machine Learning for Economists: Part 4 Shrinkage and Sparsity

Robust Variable Selection Through MAVE

Selection of Smoothing Parameter for One-Step Sparse Estimates with L q Penalty

Chris Fraley and Daniel Percival. August 22, 2008, revised May 14, 2010

UNIVERSITY OF MASSACHUSETTS. Department of Mathematics and Statistics. Basic Exam - Applied Statistics. Tuesday, January 17, 2017

Biostatistics Advanced Methods in Biostatistics IV

Or How to select variables Using Bayesian LASSO

Lecture 14: Shrinkage

Lecture 14: Variable Selection - Beyond LASSO

An Algorithm for Bayesian Variable Selection in High-dimensional Generalized Linear Models

arxiv: v3 [stat.ml] 14 Apr 2016

Robust model selection criteria for robust S and LT S estimators

Shrinkage Methods: Ridge and Lasso

Simultaneous regression shrinkage, variable selection, and supervised clustering of predictors with OSCAR

Forward Selection and Estimation in High Dimensional Single Index Models

Package Grace. R topics documented: April 9, Type Package

PENALIZING YOUR MODELS

On High-Dimensional Cross-Validation

Single Index Quantile Regression for Heteroscedastic Data

Analysis Methods for Supersaturated Design: Some Comparisons

A Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression

ISyE 691 Data mining and analytics

Data Mining Stat 588

Bayesian linear regression

Regression Shrinkage and Selection via the Lasso

MSA220/MVE440 Statistical Learning for Big Data

A Modern Look at Classical Multivariate Techniques

CS 4491/CS 7990 SPECIAL TOPICS IN BIOINFORMATICS

A simulation study of model fitting to high dimensional data using penalized logistic regression

Stability and the elastic net

Lasso: Algorithms and Extensions

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

EXTENDING PARTIAL LEAST SQUARES REGRESSION

The Adaptive Lasso and Its Oracle Properties Hui Zou (2006), JASA

A New Bayesian Variable Selection Method: The Bayesian Lasso with Pseudo Variables

IDENTIFYING MULTIPLE OUTLIERS IN LINEAR REGRESSION : ROBUST FIT AND CLUSTERING APPROACH

Statistics 203: Introduction to Regression and Analysis of Variance Penalized models

The OSCAR for Generalized Linear Models

A direct formulation for sparse PCA using semidefinite programming

Robust Regression Diagnostics. Regression Analysis

Regularization: Ridge Regression and the LASSO

Variable Selection for Highly Correlated Predictors

A Bootstrap Lasso + Partial Ridge Method to Construct Confidence Intervals for Parameters in High-dimensional Sparse Linear Models

Robust variable selection through MAVE

Standardization and the Group Lasso Penalty

Package covtest. R topics documented:

DIMENSION REDUCTION OF THE EXPLANATORY VARIABLES IN MULTIPLE LINEAR REGRESSION. P. Filzmoser and C. Croux

Gaussian Graphical Models and Graphical Lasso

Business Statistics. Lecture 10: Correlation and Linear Regression

High-dimensional regression

Statistical Inference

WEIGHTED QUANTILE REGRESSION THEORY AND ITS APPLICATION. Abstract

Improved Ridge Estimator in Linear Regression with Multicollinearity, Heteroscedastic Errors and Outliers

Bivariate Relationships Between Variables

Indian Statistical Institute

Model Selection. Frank Wood. December 10, 2009

Institute of Statistics Mimeo Series No Simultaneous regression shrinkage, variable selection and clustering of predictors with OSCAR

Variable Selection under Measurement Error: Comparing the Performance of Subset Selection and Shrinkage Methods

The Risk of James Stein and Lasso Shrinkage

A CONNECTION BETWEEN LOCAL AND DELETION INFLUENCE

Pathwise coordinate optimization

A General Framework for Variable Selection in Linear Mixed Models with Applications to Genetic Studies with Structured Populations

Chapter 3. Linear Models for Regression

The MNet Estimator. Patrick Breheny. Department of Biostatistics Department of Statistics University of Kentucky. August 2, 2010

Transcription:

Journal of the Korean Data & Information Science Society 2018, 29(3), 815 825 http://dx.doi.org/10.7465/jkdi.2018.29.3.815 한국데이터정보과학회지 Outlier detection and variable selection via difference based regression model and penalized regression InHae Choi 1 Chun Gun Park 2 Kyeong Eun Lee 3 13 Department of Statistics, Kyungpook National University 2 Department of Mathematics, Kyonggi University Received 3 May 2018, revised 22 May 2018, accepted 22 May 2018 Abstract This paper studies an efficient procedure for the outlier detection and variable selection problem in linear regression. The effect of outliers is added in linear regression as a mean shift parameter, nonzero or zero constant. To fit this mean shift model, most penalized regressions have used some adaptive penalties on the parameters to shrink most of the parameters to zero. Such penalized models do select the true variables well, but do not detect the outliers correctly. To overcome this problem, we first determine a group of possibly suspected outliers using difference-based regression model (DBRM) and add the group to the linear model as the parameters of the effect of each suspected outlier. Then, we perform outlier detection and variable selection ultaneously using Lasso regression or Elastic net regression for the linear regression with the effect term of each suspected outlier added. The proposed method is more efficient than the previous penalized regression. We compare the proposed procedure with other methods using a ulation study and apply this procedure to the real data. Keywords: Difference-based regression model, Elastic net, Lasso, outliers detection, variable selection. 1. Introduction Outliers are observations that significantly differ from the others and frequently occur in the collection of real data. They distort statistical inference; for instance, ordinary least squares estimator is very susceptible to outliers (Joo and Cho, 2016). To address this, many methods have studied on outlier detection or robust techniques. We consider the mean shift linear regression, y = β 0 1 + X 1 β 1 + X 2 β 2 + γ + ɛ, where X 1 is an n p 1 design matrix of relevant predictors, X 2 is a n p 2 design matrix of irrelevant predictors, β i is a parameter vector corresponding to X i, γ is the effect vector of outliers which consists of zero or nonzero This work was extracted from Master s thesis of InHae Choi at Kyungpook National University in December 2017. 1 Master, Department of Statistics, Kyungpook National University, Daegu 41566, Korea. 2 Associate Professor, Department of Mathematics, Kyonggi University, Suwan 16227, Korea. 3 Corresponding author: Associate Professor, Department of Statistics, Kyungpook National University, Daegu 41566, Korea. E-mail: artlee@knu.ac.kr

816 InHae Choi Chun Gun Park Kyeong Eun Lee constant and ɛ is an error vector (McCann and Welsch, 2007; She and Owen, 2011; Park and Kim, 2017). In this study, we focus on an efficient procedure for the outlier detection and variable selection in the above mean shift linear regression. Recently penalized regression methods such as Lasso (Tibshirani, 1996) and Elastic net regression (Zou and Hastie, 2005) are popular for variable selection. Also, a type of these penalized regressions can be used outlier detection. For example, She and Owen (2011) proposed a nonconvex penalty function on γ i s, the effect of the outliers. Such penalties do select the true variables well, but do not detect the outliers correctly. To overcome this problem, we first determine a group of possibly suspected outliers using difference-based regression model (DBRM) (Park and Kim, 2017; Park, 2018) and add the group to the linear model as the parameters of the effect of each suspected outlier. Then, we perform outlier detection and variable selection ultaneously using Lasso regression or Elastic net regression for the linear regression with the effect term of each suspected outlier added. The proposed method is more efficient than the previous penalized regression. In the literature, some methods such as Forward Search Algorithm (Hadi and Simono, 1993) directly detect outliers without estimating the mean function, and others such as least trimmed squares (LTS) (Rousseeuw and Leroy, 1987), thresholding based iteratively procedure (Θ-IPOD) (She and Owen, 2011) indirectly identify outliers using residuals from robust regression model. While the latter methods can detect outliers by estimating the mean trend function, our outlier detection method uses the difference-based regression model without estimating the mean trend function. The rest of this article is organized as follows. In Section 2, we introduce a differencebased regression model which is used for picking out a group of possibly suspected outliers and Lasso regression and Elastic net regression for selecting important variables. Also, we describe an efficient procedure or the outlier detection and variable selection via two penalized regressions and DBRM. In Section 3, we conduct a ulation study to compare our approach with other existing methods. In Section 4, we apply the proposed method to the consumption of petrol. Finally in Section 5, we provide some conclusions. 2. Methods In this section, we will briefly review DBRM (Park and Kim, 2017), Lasso regression (Tibshirani, 1996) and Elastic net regression (Zou and Hastie, 2005) and then propose our method for outlier detection and variable selection using DBRM and Lasso or Elastic net regression. 2.1. Difference based regression model Consider a multiple linear regression model without outlier y = 1 n β 0 + X 1 β 1 + X 2 β 2 + ɛ, where β 0 is an intercept, 1 n = [1, 1,..., 1] T is the n-dimensional unit vector, X a is the n p a matrix of rank p a, β a = [β a1,..., β apa ] T is the p a -vector of coefficient, a = 1, 2, E(ɛ) = 0 and V ar(ɛ) = σ 2 R with unknown correlation matrix R and unknown variance σ 2.

Outlier detection and variable selection via DBRM and penalized regression 817 For detecting outliers in linear regression model, the mean-shift model (Cook and Weisberg, 1982) has been used for a long time. So we consider the following mean shift model: w = y + γ + ɛ = 1 n β 0 + Xβ + γ + ɛ where X = [X 1, X 2 ], β = [β T 1, β T 2 ] T and p = p 1 + p 2. DBRM detects outliers using the following property - when the jth case is outlier, the absolute value of estimated intercept of the fitted model without the jth case is large. The DBRM can be written as: D (j) w = D (j) y + D (j) γ = 1 n 1 ( γ j ) + D (j) Xβ + A (j) γ + D (j) ɛ [ ] γj = [1 n 1 : D (j) X] + A β (j) γ + D (j) ɛ, j = 1,..., n (2.1) where ( γ j ) is the intercept for detecting outliers and D (j) and A (j) are the (n 1) n matrices as follows: [ ] [ ] I D (j) = j 1 1 j 1 0 (j 1),(n j) I and A 0 (n j),(j 1) 1 n j I (j) = j 1 0 j 1 0 (j 1),(n j). n j 0 (n j),(j 1) 0 n j I n j where I a is an a a identity matrix and O a,b is an a b null matrix. Since intercept ( γ j ) is highly influenced by outliers, we propose to determine outlier candidates using the intercepts based on the difference based regression model. Note that we are not interested in other parameters. 2.2. Lasso regression Tibshirani (1996) proposed Lasso Regression Model for regression shrinkage and selection. For any fixed non-negative λ, the Lasso estimate ˆβ L is defined by ˆβ L = arg min L(λ, β) β = arg min (y i β 0 β i=1 β j x ij ) 2 + λ β j. (2.2) Here, the larger λ, the closer β 1,..., β p are to zero. That is, the Lasso regression shrinks the coefficient estimates. But it is known to have there are some limitations. First, the Lasso method selects at most n variables before it saturates at high dimensional. Second, if there is a group of highly correlated variables, then the Lasso method tends to select one variable from a group and ignore the others (Zou and Hastie, 2005). 2.3. Elastic net regression Zou and Hastie (2005) proposed Elastic net for regularization and variable selection to overcome these limitations of Lasso. For any fixed non-negative λ 1 and λ 2, the Elastic net

818 InHae Choi Chun Gun Park Kyeong Eun Lee estimate ˆβ E is defined by ˆβ E = arg min L(λ 1, λ 2, β) β = arg min (y i β 0 β i=1 β j x ij ) 2 + λ 1 βj 2 + λ 2 β j. (2.3) The Elastic net method allows sparsity and grouping effect together (Zou and Hastie, 2005). 2.4. Algorithm We propose to use DBRM method to select outlier candidates and to perform outlier detection and variable selection ultaneously using Lasso regression or Elastic net method. This approach consists of the following steps: Step 1. Estimation of intercepts in DBRM. D (j) w = 1 n 1 ( γ j ) + D (j) Xβ + A (j) γ + D (j) ɛ. 1.1. Estimate intercepts, ˆγ j, j = 1,..., n and rewrite δ j = abs( ˆγ j ). 1.2. Ascend them, δ (j), δ (1) δ (n). Step 2. Determination outlier candidates. 2.1. Set the percentage of outlier candidates. 2.2. Consider ones with large δ (j) value as outlier candidates. 2.3. Include outlier candidates indicator variables in the multiple linear model. Step 3. Outlier detection and variable selection using Lasso or Elastic net. w = Xβ + γ 1 z 1 + + γ k z k + ɛ where z k is an indicator vector of outlier candidate. 3.1. Find Lasso estimates: (w i β 0 i=1 β j x ij 3.2. Find Elastic net estimates: k γ j z ij ) 2 + λ( β j + k γ j ). (w i β 0 β j x ij i=1 k k k γ j z ij ) 2 + λ 1 ( βj 2 + γj 2 ) + λ 2 ( β j + γ j ). j In step 3, we need to set the optimal λ for the Lasso regression and Elastic net. So, we use a method to obtain the optimal λ in a multiple regression model without outliers.

Outlier detection and variable selection via DBRM and penalized regression 819 3. Simulation studies We conduct ulations to compare the performance of our method with the other existing methods. We compare the outlier detection with the following four methods and compare the variable selection with LAD Lasso : Least trimmed squares (LT S) (Rousseeuw and Leroy, 1987), LAD regression with the Lasso penalty (LAD Lasso ) (Wang et al., 2007), Hard thresholding (denoted by θ) based iteratively procedure for outlier detection (HIP OD) (She and Owen, 2011), Soft thresholding (denoted by θ) based iteratively procedure for outlier detection (SIP OD) (She and Owen, 2011). 3.1. Simulation setting We consider three cases of samples sizes (n = 30, 100, 300) each with 10% outliers and p = n 10 covariates (p 1 = p 2 = p 2 ). Covariates are generated from multivariate normal distribution with mean 0 and covariance matrix Σ = {σ ij }, σ ij = ρ i j. We consider two cases of correlation of covariates (ρ = 0, 0.5) and two different errors distributions (N(0, 1), t df=2 ). So the total number of ulation cases is 12. Outlier locations and signs are randomly selected and outlier sizes are 8 or 10. We generate 100 data sets for each case. We set the number of outlier candidate group as 30% of the sample size. 3.2. Criteria Let n O be the number of true outliers, n D be the number of detected outliers, n CD be the number of correctly detected outliers, n ID be the number of incorrectly detected outliers, n IU be the number of incorrectly undetected outliers and n CU be the number of correctly undetected non-outliers. Detection Table 3.1 Outlier detection True Outlier Non-outlier Sum Outlier n CD n ID n D Non-outlier n IU n CU n UD Sum n O n NO n Two kinds of errors can occur in outlier detection: masking (a true outlier is not detected) and swamping (a non-outlier is detected as an outlier). We define relative frequencies of perfection, only masking, only swamping, masking and swamping and complete failure in order to show the strength of our proposed method as follows: - Relative frequency of perfect detection: P D = 1 I(n CD(s) = n O(s) ). n - Relative frequency of only-swamping with detection (overdetection): OS = 1 I(n IU(s) = 0, n ID(s) > 0, n CD(s) > 0). n

820 InHae Choi Chun Gun Park Kyeong Eun Lee - Relative frequency of only-masking with partial detection: OM = 1 I(n IU(s) > 0, n ID(s) = 0, n CD(s) > 0). n - Relative frequency of masking and swamping with partial detection: MS = 1 I(n IU(s) > 0, n ID(s) > 0, n CD(s) > 0). n - Relative frequency of complete failure relative frequency: CF = 1 I(n CD(s) = 0). n Finally, we use three criteria for variable selection. Let CS be the relative frequency of correct selection, CR be relative frequency of correct variable reduction and AN be the average number of selected variables: - CS : the relative frequency of correct selection: CS = 1 ( I {j : n ˆβ ) j,s 0} = {j : β j 0}. - CR : the relative frequency of correct variable reduction: CR = 1 ( I {j : n ˆβ ) j,s 0} {j : β j 0}. - AN : the average number of selected variables: 3.3. Simulation result AN = 1 n ( #{j : ˆβ ) j,s 0}. We show the results of the 12 ulation cases. Table 3.2 is the results of outliers detection. METHOD1 combines the Difference-Based Regression Model with the Lasso regression and METHOD2 combines Difference-Based Regression Model with the Elastic net regression. Regardless of sample size or other conditions, our method has comparable performance, mainly in detection power. And the larger the number of samples and the correlation between variables, the better the performance of detecting outlier. Although our performance is not superior to other methods, overall performance is comparable. Next, we show the result of the variable selection. Table 3.3 is the results of variable selection.

Outlier detection and variable selection via DBRM and penalized regression 821 Our method gets comparable results with other existing methods through ulation studies and it is superior in most cases. Although the number of variables selected is larger than the number of important variables, the rate of correct variable reduction is much better than the other methods. Table 3.2 Comparison of outlier detection ɛ ρ n Criteria LTS LAD Lasso HIPOD SIPOD METHOD1 METHOD2 PD 0 0 0.02 0.46 0.29 0.17 OS 0 0 0.98 0.53 0.70 0.83 n = 30 OM 0 0 0 0.01 0.01 0 PD 0 0 0.02 0.15 0 0 OS 0 0 0.98 0.85 1 1 ρ = 0 n = 100 OM 0 0 0 0.01 0.01 0 PD 0 0 0.03 0.02 0 0 OS 0 0 0.97 0.98 1 1 n = 300 OM 0 0 0 0 0 0 N(0, 1) PD 0 0 0.02 0.46 0.18 0.11 OS 0 0 0.98 0.53 0.81 0.88 n = 30 OM 0 0 0 0.01 0.01 0.01 PD 0 0 0.02 0.15 0.21 0.11 OS 0 0 0.98 0.85 0.79 0.89 ρ = 0.5 n = 100 OM 0 0 0 0 0 0 PD 0 0 0.03 0.02 0.13 0.08 OS 0 0 0.97 0.98 0.87 0.92 n = 300 OM 0 0 0 0 0 0 PD 0 0 0.01 0.09 0.06 0.05 OS 0 0 0.99 0.85 0.90 0.93 n = 30 OM 0 0 0 0.01 0.01 0 MS 0 0 0 0 0.03 0.02 CF 1 1 0 0.05 0 0 PD 0 0 0 0 0 0 OS 0 0 1 0.81 0.92 0.92 t df=2 ρ = 0 n = 100 OM 0 0 0 0.02 0.03 0.01 MS 0 0 0 0 0.03 0.06 CF 1 1 0 0.16 0.02 0.01 PD 0 0 0 0 0 0 OS 0 0 1 0.37 0.93 0.93 n = 300 OM 0 0 0 0.05 0 0 MS 0 0 0 0 0.05 0.05 CF 1 1 0 0.58 0.02 0.02

822 InHae Choi Chun Gun Park Kyeong Eun Lee Table 3.2 Continued ɛ ρ n Criteria LTS LAD Lasso HIPOD SIPOD METHOD1 METHOD2 PD 0 0 0.01 0.09 0.05 0.01 OS 0 0 0.99 0.85 0.92 0.94 n = 30 OM 0 0 0 0.01 0.01 0.01 MS 0 0 0 0 0.02 0.04 CF 1 1 0 0.05 0 0 PD 0 0 0 0 0 0 OS 0 0 1 0.81 0.96 0.97 t df=2 ρ = 0.5 n = 100 OM 0 0 0 0.02 0.01 0.01 MS 0 0 0 0.01 0.02 0.01 CF 1 1 0 0.16 0.01 0.01 PS 0 0 0 0 0 0 OS 0 0 1 0.37 0.92 0.93 n = 300 OM 0 0 0 0.05 0 0 MS 0 0 0 0 0.06 0.05 CF 1 1 0 0.58 0.02 0.02 Table 3.3 Comparison of variable selection ɛ ρ n Criteria LAD Lasso METHOD1 METHOD2 CS 0.46 0.40 0.48 n = 30 CR 0.59 0.68 0.80 AN 2.76 1.92 2.1 CS 0.44 0.61 0.47 ρ = 0 n = 100 CR 0.53 0.94 0.96 AN 1.73 2.28 2.47 CS 0.44 0.61 0.47 n = 300 CR 0.53 0.94 0.96 N(0, 1) AN 1.73 2.28 2.47 CS 0.44 0.61 0.47 n = 30 CR 0.53 0.94 0.96 AN 1.73 2.28 2.47 CS 0.44 0.61 0.47 ρ = 0.5 n = 100 CR 0.53 0.94 0.96 AN 1.73 2.28 2.47 CS 0.44 0.61 0.47 n = 300 CR 0.53 0.94 0.96 AN 1.73 2.28 2.47 CS 0.34 0.31 0.3 n = 30 CR 0.44 0.59 0.63 AN 1.62 1.76 1.89 CS 0.31 0.39 0.33 t df=2 ρ = 0 n = 100 CR 0.42 0.80 0.85 AN 1.68 2.25 2.37 CS 0.31 0.39 0.33 n = 300 CR 0.42 0.80 0.85 AN 1.68 2.25 2.37

Outlier detection and variable selection via DBRM and penalized regression 823 Table 3.3 Continued ɛ ρ n Criteria LAD Lasso METHOD1 METHOD2 CS 0.31 0.39 0.33 n = 30 CR 0.42 0.80 0.85 AN 1.68 2.25 2.37 CS 0.31 0.39 0.33 t df=2 ρ = 0.5 n = 100 CR 0.42 0.80 0.85 AN 1.68 2.25 2.37 CS 0.31 0.39 0.33 n = 300 CR 0.42 0.80 0.85 AN 1.68 2.25 2.37 4. Real data analysis We apply our procedure to the consumption of fuel. Because the consumption of fuel was measured in 48 states, there are 48 rows of data. The response variable (Y ) and four explanatory variables (X 1, X 2, X 3, and X 4 ) are as the followings: - Y = Consumption of fuel (millions of gallons). - X 1 = fuel tax (cents per gallon). - X 2 = Average income (dollars). - X 3 = Paved Highways (miles). - X 4 = Proportion of population with driver s licenses. Looking at Figure 4.1, we see that the 40th observation is an outlier. Table 4.1 is the result of outlier detection in fuel consumption data. Among the methods, HIPOD, SIPOD, METHOD1 and METHOD2 identify that the 40th observation is an outlier. But HIPOD, METHOD1 and METHOD2 tend to detect too many outliers. Table 4.1 The result of outlier detection in fuel consumption data Method Number of Outliers LTS 0 { } LAD Lasso 0 { } HIPOD 22 Outlier index set {5, 8, 9, 11, 15, 16, 18, 19, 20, 22, 23, 24, 31, 33, 34, 38, 39, 40, 42, 43, 44, 45} SIPOD 2 {18, 40} METHOD1 10 {5, 11, 18, 19, 33, 38, 40, 42, 44, 45} METHOD2 14 {5, 11, 15, 18, 19, 20, 33, 36, 38, 39, 40, 42, 44, 45} Table 4.2 is the results of variable selection in fuel consumption data. LAD-Lasso and METHOD1 consider X 3 as the unimportant variable and select 3 variables, METHOD2 does not find the unimportant variable and selects all 4 variables.

824 InHae Choi Chun Gun Park Kyeong Eun Lee Figure 4.1 Residual plot Table 4.2 The result of variable selection in fuel consumption data Method Number of selected variables Unselected variable LAD Lasso 3 X 3 METHOD1 3 X 3 METHOD2 4 5. Conclusion In this paper, using the properties of intercept estimator in a difference-based regression model, we propose a ultaneous procedure of an outlier detection and variable selection which is more efficient and pler than using each procedure separately. Our procedure does not require us to estimate mean function. Instead, we determine the candidate group of outliers using the intercept estimates in the difference-based regression model and add the candidate group to the model. Then, we use outlier detection and variable selection using Lasso regression and Elastic net regression. Our procedure gets results comparable with other existing methods through ulation studies, and it is superior in most cases. We apply our procedure to real data. For future study, we can extend to the nonparametric regression model for outliers detection. References Cook, R.D. and Weisberg, S. (1982). Residuals and influence in regression, Chapman & Hall, London, UK. Hadi, A. S. and Simonoff, J. S. (1993). Procedures for the identification of multiple outliers in linear models. Journal of the American Statistical Association, 88, 1264-1272. Joo, Y. S. and Cho, G-Y. (2016). Outlier detection and treatment in industrial sampling survey. Journal of the Korean Data & Information Science Society, 27, 131-142.

Outlier detection and variable selection via DBRM and penalized regression 825 McCann, L., and Welsch, R. E. (2007). Robust variable selection using least angle regression and elemental set sampling. Computational Statistics & Data Analysis, 52, 249-257. Park, C. G. (2018). Distinction of an outlier(s) using difference based regression models. Journal of the Korean Data & Information Science Society, 29, 339-350. Park, C. G. and Kim, I. (2017). Outlier detection using difference based regression Model. Manuscript. Rousseeuw, P. J., and Leroy, A. M. (1987). Robust regression and outlier detection, Wiley Series in Probability and Mathematical Statistics: Applied Probability and Statistics, John Wiley & Sons Inc, New York. She, Y. and Owen, A. B. (2011). Outlier detection using nonconvex penalized regression. Journal of the American Statistical Association, 106, 626-639. Tibshirani, R. (1996). Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society. Series B, 58, 267-288. Wang, H., Li, G. and Jiang, G. (2007). Robust regression shrinkage and consistent variable selection through the lad-lasso. Journal of Business & Economic Statistics, 25, 347-355. Zou, H and Hastie, T. (2005). Regularization and variable selection via the Elastic net. Journal of the Royal Statistical Society. Series B, 67, 301-320.