# IDENTIFYING MULTIPLE OUTLIERS IN LINEAR REGRESSION : ROBUST FIT AND CLUSTERING APPROACH

Save this PDF as:

Size: px
Start display at page:

Download "IDENTIFYING MULTIPLE OUTLIERS IN LINEAR REGRESSION : ROBUST FIT AND CLUSTERING APPROACH"

## Transcription

1 SESSION X : THEORY OF DEFORMATION ANALYSIS II IDENTIFYING MULTIPLE OUTLIERS IN LINEAR REGRESSION : ROBUST FIT AND CLUSTERING APPROACH Robiah Adnan 2 Halim Setan 3 Mohd Nor Mohamad Faculty of Science, Universiti Teknologi Malaysia (UTM) address: 2 Faculty of Geoinformation Science and Engineering, UTM address: 3 Faculty of Science, UTM address: Abstract This research provides a clustering based approach for determining potential candidates for outliers. This is a modification of the method proposed by Serbert et.al (998). It is based on using the single linkage clustering algorithm to group the standardized predicted and residual values of data set fit by least trimmed of squares (LTS).. Introduction It is customary to mathematically model a response variable as a function of regressors using linear regression analysis. Linear regression analysis is one of the most important and widely used statistical technique in the fields of engineering, science and management. The linear regression model can be expressed in terms of matrices as y = Xβ+ε where y is the nx vector of observed response values, X is the nxp matrix of p regressors (design matrix), β is the px regression coefficients and ε is the nx vector of error terms. The goal of regression analysis is to find the estimates of the unknown parameter which is the regression coefficients β from the observed sample. The most widely used technique to find the best estimates of β is the method of ordinary least squares (OLS). From Gauss-Markov theorem, least squares is always the best linear unbiased estimator (BLUE) and if ε is normally distributed with mean and variance σ 2 I, least squares is the uniformly minimum variance unbiased estimator. Inference procedures such as hypothesis tests, confidence intervals and prediction intervals are powerful under the assumption of normality with mean and variance σ 2 I of the error term ε. Violation of this assumption can distort the fit of the regression model and consequently the parameter estimates and inferences can be flawed. Violation of NID distribution of the error term can occur when there are one or more outliers in the data set. According to Barnett and Lewis (994), an outlier is an observation that is inconsistent with the rest of the data. Such outliers can have a profound destructive influence on the statistical analysis. Observations in a data set can be an outlier in several different ways. In the linear regression model, outliers are grouped in three classes: ) residual or regression outliers, where the values for the independent variables can be very different from the fitted values of uncontaminated data 38 The th FIG International Symposium on Deformation Measurements

2 IDENTIFYING MULTIPLE OUTLIERS IN LINEAR REGRESSION : ROBUST FIT AND CLUSTERING APPROACH set, 2) leverage outliers, whose regressor variable values are extreme in X-space, 3) both residual and leverage outliers. Many standard least squares regression diagnostics can identify the existence of a single or few outliers. But these techniques have been shown to fail in the presence of multiple outliers. These methods are poor at identifying multiple outliers because of swamping and masking effect. Swamping occurs when observations are falsely identified as outliers. While masking occurs when true outliers are not identified. There are two general approaches used for the handling of outliers. The first is to identify the outliers and to remove or edit them from the data set. The second is to accommodate these outliers by using some robust procedure which diminishes their influence in the statistical analysis. The procedure that is proposed in this paper is some kind of robust since it uses the least trimmed of squares fit (LTS) instead of the least squares fit (LS). 2. Method The methodology used in this paper is a modification of the technique done by Serbert et. al (998). It is a well known fact that the least trimmed sum of squares (LTS) is a high breakdown estimator, where breakdown point is defined as the smallest fraction of unusual data that can render the estimator useless. From figure No., it can be seen that the LTS produces a better regression line compared to LS in the presence of outliers. Figure. Results of LS vs LTS 25 2 outliers 5 OLS fit with outliers y = 7.868x y 5 y=4.963x-.93 LTS fit with outliers 5 x The proposed method in this paper uses the standardized predicted and residual values from the least trimmed of squares (LTS) fit rather than the least squares (LS) fit used in the method by Serbert et. al (998). The LTS regression was proposed by Rousseeuw and Leroy (987). This method is similar to the least squares but instead of using all the squared residuals, it minimizes the sum of the h smallest squared residuals. The objective function is 9 22 March 2 Orange, California, USA 38

3 SESSION X : THEORY OF DEFORMATION ANALYSIS II Minimize h 2 ˆ ( r ) β i: n i= and it is solved either by random resampling (Rousseeuw and Leroy, 987), a genetic algorithm (Burns, 992) which is used in S-Plus (Programming Package) or forward search (Woodruff and Rocke, 994). In this paper, the genetic algorithm (Burns, 992) was used to solve the objective function of LTS. Putting h = (n/2) + [(p + )/2], LTS reaches the maximal possible value for breakdown point. We assumed that the clean observations will form a linear relationship that is a horizontal chain-like cluster on the predicted and residual plot. Therefore to identify a clean base subset, the single linkage algorithm is used since this is the best technique for identifying an elongated clusters. Virtually all clustering procedures provide little information as to the number of clusters present in the data. Hierarchical methods routinely produce a series of solutions ranging from n clusters to a single cluster. So it is necessary to get a procedure that will tell us the number of clusters actually exist in the data set, that is the cluster tree needs to be cut at a certain height. Therefore the number of clusters depend on the height of the cut. When this procedure is applied to the results of hierarchical clustering methods, it is sometimes referred to as stopping rule. The stopping rule used in this paper is proposed by Mojena(977). This rule resembles a onetailed confidence interval based on the n- heights (joining distances) of the cluster tree. The proposed methodology can be summarized as the following:. Standardize the predicted and residual values from least trimmed of squares (LTS) fit. 2. The Euclidean distance instead of Mahalanobis distance between pairs of the standardized predicted and residual values from step is used as the similarity measure since it is well known that the predicted values and the residuals are not correlated to each other. It is assumed that the observations that are not outliers will generally have a linear relationship, that is from a clustering point of view, one is looking for a horizontal long chain-like cluster. This kind of cluster can be identified successfully by single linkage clustering algorithm. 3. Form clusters based on tree height (ch, a measure of closeness) using the Mojena s stopping rule (ch = h + ksh where h is the average height of the tree and s h is the sample standard deviation of heights and k is a specified constant). Milligan and Cooper (985) in a comprehensive study, conclude that the best overall performance is when k is set to The clean data set is the largest cluster formed which includes the median, while the other clusters are considered to be the potential outliers. To illustrate the proposed methodology, wood specific gravity data set from Draper and Smith (966) is considered in this paper. A comparison between methods done by Serbert et. al (998) ans the proposed method which is referred as method is demonstrated in the next example. 3. Example A real data set from Draper and Smith (966, p.227) which has 2 observations and 5 regressor variables is used. Modification to the data set is done by Rousseeuw and Leroy (987) so that observations 4, 6, 8, and 9 are xy-space outlying ( as in Table No. ) 382 The th FIG International Symposium on Deformation Measurements

4 IDENTIFYING MULTIPLE OUTLIERS IN LINEAR REGRESSION : ROBUST FIT AND CLUSTERING APPROACH Table. Sample data with 4 outliers Obs no X X2 X3 X4 X5 Y In Table No. 2 are the standardized predicted and residuals values for the method by Serbert et. al (998) and the proposed method (Method ). Table 2. Results: Method vs Serbert Serbert Proposed method (Method) obs Pred.values residuals Pred. values residuals Below are the clusters ( Table No. 3) produced by Serbert s method where the dendogram (Figure No. 2) is cut at the height of.9577 using the Mojenas s stopping rule. There are three clusters formed and the cluster with the largest size which is cluster is considered to contain the 9 22 March 2 Orange, California, USA 383

5 SESSION X : THEORY OF DEFORMATION ANALYSIS II clean observations while cluster 2 and cluster 3 contain the potential outliers. The potential outliers are observations 4, 6, 7, 8, and 9. Swamping effect is evident since clean observations 7 and are flagged as outliers. Table 3. Clusters: Serbert s method obs clust Figure 2. Dendogram for Serbert s method Height The following are the clusters ( Table No. 4) formed using the proposed method (method ) where the dendogram (Figure No. 3) is cut at a height of.9656 based on Mojenas s stopping rule. Two clusters are formed with cluster 2 being the cluster containing the clean observations while cluster contains the potential outliers. This method correctly identifies all the outliers. Table 4. Clusters: Method obs clust It can be seen from the results obtained, method is successful in flagging all the outliers. Although Serbert s method is also successful in flagging all the outliers but it also wrongly flagged observations 7 and as outliers. It is evident that the effect of swamping is present when Serbert s method is used. A simulation study is conducted to test the performance of the proposed method (method )in various outlier scenarios and regression conditions. Since this methodology is a modification of Serbert s method, it is logical that a comparison between these two methods are conducted using the scenarios proposed by Serbert et. al (998). 384 The th FIG International Symposium on Deformation Measurements

6 IDENTIFYING MULTIPLE OUTLIERS IN LINEAR REGRESSION : ROBUST FIT AND CLUSTERING APPROACH Figure 3. Dendogram for method 2 5 Height Simulation study There are 6 outlier scenarios and a total of 24 different outlier situations ( each scenario consists of 4 situations differing in the percentage of outliers, distance from clean observations and the number of regressors) considered in this research. Figure No. 4 shows a simple regression picture of the six outlier scenarios chosen for this research. For each of the simulations, the value of the regression coefficients was set to be 5. The distribution of the regressor variables for the clean observations was set to be uniform U (,2). The distribution of the error term for the clean and the outlying observations were assumed to be normal N(,). The following are the descriptions of the six scenarios. In scenario, there is one group of xyspace outlying observations with the regressor variable values approximately 2. There is one xyspace outlying group of outlying observations in scenario 2, but with the regressor variable values of approximately 3. While in scenario 3, there are two xy-space outlying groups where one group with regressor variable values of approximately 2 and the other group with regressor variable values of approximately. Scenario 4 is the same as scenario 3 except that in scenario 4, the regressor variable values are approximately and 3. In the case of scenario 5, there is one x-space outlying group where the regressor variable values approximatly 3. In the last scenario which is scenario 6, there is one x-space outlying group and one xy-space outlying group both with regressor variable values approximately 3. The factors considered in this simulation are similar to the classic data sets found in the literature. The levels for percentage of outliers are typically % and 2%. The number of outlying groups are one or two. The number of regressors is either, 3, or 6. The distances for the outliers were selected to be 5 standard deviations and standard deviations. The simulation scenarios place a cluster or two clusters of several observations at a specified location shifted in X- space (outlying only in the regressor variables only), shifted in Y-space (outlying in the response variable only) and also shifted in XY-space The primary measures of performances are: () the probability that an outlying observation is detected (tppo) and (2) the probability that a known clean observation is identified as an outlier (tpswamp) that is, the occurrence of false alarm. Code development and simulations were done using S-Plus for Windows version 4.5 (Statistical Package) March 2 Orange, California, USA 385

7 SESSION X : THEORY OF DEFORMATION ANALYSIS II Figure 4. 6 Outliers scenario scenario scenario scenario 3 scenario scenario 5 scenario Results of Simulations Method Serbert tpswamp, method tpswamp, Serbert's Scenarios Figure No. 5 : Comparison of performances between method and Serbert's for n = 2 and p = 386 The th FIG International Symposium on Deformation Measurements

8 IDENTIFYING MULTIPLE OUTLIERS IN LINEAR REGRESSION : ROBUST FIT AND CLUSTERING APPROACH tppo,method tppo, Serbert tpswamp, method tpswamp, Serbert Situations Figure No. 6 : Comparison of performance between method and Serbert's for n = 4 and p = tppo, method tppo, Serbert tpswamp, method tpswamp, Serbert Situations Figure No. 7 : Comparison of performance for method and Serbert's with n = 2 and p = Situations Figure No. 8 : Comparison of performance between method and Serbert's for n = 4 and p = 3 tppo, method tppo, Serbert tpswamp, method tpswamp, Serbert Figure 5 shows that the detection probability (tppo) for the proposed method ( method ) is better than Serbert s in every scenarios except in situations 2 through 24 (scenario 6 in Figure No.4) where there are two outlying groups with one of them an x-space outlying observations and the other is an xy-space outlying observations.the probability of swamping (tpswamp) is also lower in method compared to Serbert s method, that is, lower probability of false alarm. But both methods have difficulty in detecting outliers in situations 3, 9,, 5, and 2 through 24. Figure 6 illustrates that the detection probability improves significantly for both methods when the sample size increases from n = 2 to n = 4, but method still outperforms Serbert's in situations 3 and where the percentage of outliers present is 2% and the distance of outliers from the clean data is 5 sigma March 2 Orange, California, USA 387

9 SESSION X : THEORY OF DEFORMATION ANALYSIS II Serbert method tpswamp, method tpswamp, Serbert's Scenarios Figure No. 9 : Comparison of performances for method and Serbert's for n = 2 and p = method Serbert tpswamp, method tpswamp, Serbert's scenarios Figure No. : Comparison of performances between method and Serbert's for n = 4 and p = 6 Figures 7 and 8 show that both methods performed much better at detecting outliers when the number of regressors increases from p = to p = 3. For n = 2, both methods have difficulty in detecting outliers in situations 3 and but method still outperforms Serbert s in those situations. As n increases from 2 to 4, both methods are equally good in every situations. Method has a smaller probability of swamping in most of the situations. Figures 9 and illustrate the performances of both methods when p = 6. The detection probability for method is significantly lower than Serbert s in situations 3, and 2 when the sample size is 2 but improves when n is increased from 2 to 4. For n = 2, the probability of swamping is greater in method than Serbert s in situations 3, 7, 8,, 5, 6, 7, 9 and Conclusions Many authors have suggested using a clustering based approach in detecting multiple outliers. Among others, Gray and Ling (984) propose a clustering algorithm to identify potentially influential subsets. Hadi and Simonoff (993) propose a clustering strategy for identifying multiple outliers in linear regression by using the single linkage clustering on Z = [X y]. As mentioned earlier, Serbert et. al (998) also propose a clustering based method by using the single linkage clustering on the residual and associated predicted values from a least squares fit. The 388 The th FIG International Symposium on Deformation Measurements

10 IDENTIFYING MULTIPLE OUTLIERS IN LINEAR REGRESSION : ROBUST FIT AND CLUSTERING APPROACH proposed method, method is a modification of Serbert s, that is, replacing the least squares fit with a more robust fit which is the least trimmed of squares (LTS). From the simulation results obtained in section 3, method is better than Serbert s in detecting outliers in situations where the number of regressors p is less than or equal to 3 for small n (n = 2). The success of detecting outliers for method can be improved significantly and as competitive as Serbert s with the increase in the sample size even for situations where the number of regressors p is large. It is hoped that other clustering algorithm, may be a non hierarchical clustering algorithm, for example the kmeans clustering algorithm can be used effectively for identifying multiple outliers. Reference Barnett, V. and Lewis, T. (994), Outliers in Statistical Data, 3 rd ed, John Wiley, Great Britain. Burns, P.J.(992). A genetic algorithm for robust regression estimation, StatSci Technical Note, Seattle, WA. Draper and Smith (966), Applied Regression Analysis, New York, John Wiley and Sons, New York. Gray, J.B. and Ling, R.F., (984), K-clustering as a detection tool for influential subsets in regression, Technometrics 26, Hadi, A. S. and Simonoff, J.S. (993), Procedures for the Identification of Multiple Outliers in Linear Models, Journal of the American Statistical Association, vol 88, No Hocking, R.R. (996), Methods and Applications of Linear models - Regression and Analysis of Variance, John Wiley, New York. Mojena, R. (977), Hierarchical grouping methods and stopping rules: An evaluation. The Computer Journal, 2, Milligan G. W., and Cooper, M.C. (985), An Examination of Procedures for Determining the Number of Clusters in a Data Set. Psychometrika, Vol. 5, No. 2, Rousseeuw and Leroy (987), Robust Regression and outlier detection, John Wiley & Sons, New York. Serbert, D.M., Montgomery,D.C. and Rollier, D. (998). A clustering algorithm for identifying multiple outliers in linear regression, Computational Statistics & Data Analysis, 27, March 2 Orange, California, USA 389

### Regression Analysis for Data Containing Outliers and High Leverage Points

Alabama Journal of Mathematics 39 (2015) ISSN 2373-0404 Regression Analysis for Data Containing Outliers and High Leverage Points Asim Kumer Dey Department of Mathematics Lamar University Md. Amir Hossain

### Using Ridge Least Median Squares to Estimate the Parameter by Solving Multicollinearity and Outliers Problems

Modern Applied Science; Vol. 9, No. ; 05 ISSN 9-844 E-ISSN 9-85 Published by Canadian Center of Science and Education Using Ridge Least Median Squares to Estimate the Parameter by Solving Multicollinearity

### Effects of Outliers and Multicollinearity on Some Estimators of Linear Regression Model

204 Effects of Outliers and Multicollinearity on Some Estimators of Linear Regression Model S. A. Ibrahim 1 ; W. B. Yahya 2 1 Department of Physical Sciences, Al-Hikmah University, Ilorin, Nigeria. e-mail:

### Identification of Multivariate Outliers: A Performance Study

AUSTRIAN JOURNAL OF STATISTICS Volume 34 (2005), Number 2, 127 138 Identification of Multivariate Outliers: A Performance Study Peter Filzmoser Vienna University of Technology, Austria Abstract: Three

### Supplementary Material for Wang and Serfling paper

Supplementary Material for Wang and Serfling paper March 6, 2017 1 Simulation study Here we provide a simulation study to compare empirically the masking and swamping robustness of our selected outlyingness

### Outlier Detection via Feature Selection Algorithms in

Int. Statistical Inst.: Proc. 58th World Statistical Congress, 2011, Dublin (Session CPS032) p.4638 Outlier Detection via Feature Selection Algorithms in Covariance Estimation Menjoge, Rajiv S. M.I.T.,

### Estimating σ 2. We can do simple prediction of Y and estimation of the mean of Y at any value of X.

Estimating σ 2 We can do simple prediction of Y and estimation of the mean of Y at any value of X. To perform inferences about our regression line, we must estimate σ 2, the variance of the error term.

### MULTICOLLINEARITY DIAGNOSTIC MEASURES BASED ON MINIMUM COVARIANCE DETERMINATION APPROACH

Professor Habshah MIDI, PhD Department of Mathematics, Faculty of Science / Laboratory of Computational Statistics and Operations Research, Institute for Mathematical Research University Putra, Malaysia

### Accurate and Powerful Multivariate Outlier Detection

Int. Statistical Inst.: Proc. 58th World Statistical Congress, 11, Dublin (Session CPS66) p.568 Accurate and Powerful Multivariate Outlier Detection Cerioli, Andrea Università di Parma, Dipartimento di

### A Modified M-estimator for the Detection of Outliers

A Modified M-estimator for the Detection of Outliers Asad Ali Department of Statistics, University of Peshawar NWFP, Pakistan Email: asad_yousafzay@yahoo.com Muhammad F. Qadir Department of Statistics,

### Detecting outliers in weighted univariate survey data

Detecting outliers in weighted univariate survey data Anna Pauliina Sandqvist October 27, 21 Preliminary Version Abstract Outliers and influential observations are a frequent concern in all kind of statistics,

### Small sample corrections for LTS and MCD

Metrika (2002) 55: 111 123 > Springer-Verlag 2002 Small sample corrections for LTS and MCD G. Pison, S. Van Aelst*, and G. Willems Department of Mathematics and Computer Science, Universitaire Instelling

### Lecture 5: Clustering, Linear Regression

Lecture 5: Clustering, Linear Regression Reading: Chapter 10, Sections 3.1-3.2 STATS 202: Data mining and analysis October 4, 2017 1 / 22 Hierarchical clustering Most algorithms for hierarchical clustering

### , (1) e i = ˆσ 1 h ii. c 2016, Jeffrey S. Simonoff 1

Regression diagnostics As is true of all statistical methodologies, linear regression analysis can be a very effective way to model data, as along as the assumptions being made are true. For the regression

### Lecture 2 Linear Regression: A Model for the Mean. Sharyn O Halloran

Lecture 2 Linear Regression: A Model for the Mean Sharyn O Halloran Closer Look at: Linear Regression Model Least squares procedure Inferential tools Confidence and Prediction Intervals Assumptions Robustness

### Accounting for measurement uncertainties in industrial data analysis

Accounting for measurement uncertainties in industrial data analysis Marco S. Reis * ; Pedro M. Saraiva GEPSI-PSE Group, Department of Chemical Engineering, University of Coimbra Pólo II Pinhal de Marrocos,

### Review of Statistics

Review of Statistics Topics Descriptive Statistics Mean, Variance Probability Union event, joint event Random Variables Discrete and Continuous Distributions, Moments Two Random Variables Covariance and

### Introduction to Regression

Introduction to Regression Using Mult Lin Regression Derived variables Many alternative models Which model to choose? Model Criticism Modelling Objective Model Details Data and Residuals Assumptions 1

### Multiple Linear Regression

Multiple Linear Regression Asymptotics Asymptotics Multiple Linear Regression: Assumptions Assumption MLR. (Linearity in parameters) Assumption MLR. (Random Sampling from the population) We have a random

### The Model Building Process Part I: Checking Model Assumptions Best Practice (Version 1.1)

The Model Building Process Part I: Checking Model Assumptions Best Practice (Version 1.1) Authored by: Sarah Burke, PhD Version 1: 31 July 2017 Version 1.1: 24 October 2017 The goal of the STAT T&E COE

### A Second Course in Statistics: Regression Analysis

FIFTH E D I T I 0 N A Second Course in Statistics: Regression Analysis WILLIAM MENDENHALL University of Florida TERRY SINCICH University of South Florida PRENTICE HALL Upper Saddle River, New Jersey 07458

### Linear Models 1. Isfahan University of Technology Fall Semester, 2014

Linear Models 1 Isfahan University of Technology Fall Semester, 2014 References: [1] G. A. F., Seber and A. J. Lee (2003). Linear Regression Analysis (2nd ed.). Hoboken, NJ: Wiley. [2] A. C. Rencher and

### IES 612/STA 4-573/STA Winter 2008 Week 1--IES 612-STA STA doc

IES 612/STA 4-573/STA 4-576 Winter 2008 Week 1--IES 612-STA 4-573-STA 4-576.doc Review Notes: [OL] = Ott & Longnecker Statistical Methods and Data Analysis, 5 th edition. [Handouts based on notes prepared

### MOVING WINDOW REGRESSION (MWR) IN MASS APPRAISAL FOR PROPERTY RATING. Universiti Putra Malaysia UPM Serdang, Malaysia

MOVING WINDOW REGRESSION (MWR IN MASS APPRAISAL FOR PROPERTY RATING 1 Taher Buyong, Suriatini Ismail, Ibrahim Sipan, Mohamad Ghazali Hashim and 1 Mohammad Firdaus Azhar 1 Institute of Advanced Technology

### ISyE 691 Data mining and analytics

ISyE 691 Data mining and analytics Regression Instructor: Prof. Kaibo Liu Department of Industrial and Systems Engineering UW-Madison Email: kliu8@wisc.edu Office: Room 3017 (Mechanical Engineering Building)

### Variable Selection and Model Building

LINEAR REGRESSION ANALYSIS MODULE XIII Lecture - 37 Variable Selection and Model Building Dr. Shalabh Department of Mathematics and Statistics Indian Institute of Technology Kanpur The complete regression

### PAijpam.eu M ESTIMATION, S ESTIMATION, AND MM ESTIMATION IN ROBUST REGRESSION

International Journal of Pure and Applied Mathematics Volume 91 No. 3 2014, 349-360 ISSN: 1311-8080 (printed version); ISSN: 1314-3395 (on-line version) url: http://www.ijpam.eu doi: http://dx.doi.org/10.12732/ijpam.v91i3.7

### Improvement of The Hotelling s T 2 Charts Using Robust Location Winsorized One Step M-Estimator (WMOM)

Punjab University Journal of Mathematics (ISSN 1016-2526) Vol. 50(1)(2018) pp. 97-112 Improvement of The Hotelling s T 2 Charts Using Robust Location Winsorized One Step M-Estimator (WMOM) Firas Haddad

### Ma 3/103: Lecture 24 Linear Regression I: Estimation

Ma 3/103: Lecture 24 Linear Regression I: Estimation March 3, 2017 KC Border Linear Regression I March 3, 2017 1 / 32 Regression analysis Regression analysis Estimate and test E(Y X) = f (X). f is the

### Lecture 3: Inference in SLR

Lecture 3: Inference in SLR STAT 51 Spring 011 Background Reading KNNL:.1.6 3-1 Topic Overview This topic will cover: Review of hypothesis testing Inference about 1 Inference about 0 Confidence Intervals

### Multiple Regression Analysis

Chapter 4 Multiple Regression Analysis The simple linear regression covered in Chapter 2 can be generalized to include more than one variable. Multiple regression analysis is an extension of the simple

### Wiley. Methods and Applications of Linear Models. Regression and the Analysis. of Variance. Third Edition. Ishpeming, Michigan RONALD R.

Methods and Applications of Linear Models Regression and the Analysis of Variance Third Edition RONALD R. HOCKING PenHock Statistical Consultants Ishpeming, Michigan Wiley Contents Preface to the Third

### GeoDa-GWR Results: GeoDa-GWR Output (portion only): Program began at 4/8/2016 4:40:38 PM

New Mexico Health Insurance Coverage, 2009-2013 Exploratory, Ordinary Least Squares, and Geographically Weighted Regression Using GeoDa-GWR, R, and QGIS Larry Spear 4/13/2016 (Draft) A dataset consisting

### Applications and Algorithms for Least Trimmed Sum of Absolute Deviations Regression

Southern Illinois University Carbondale OpenSIUC Articles and Preprints Department of Mathematics 12-1999 Applications and Algorithms for Least Trimmed Sum of Absolute Deviations Regression Douglas M.

### For Bonnie and Jesse (again)

SECOND EDITION A P P L I E D R E G R E S S I O N A N A L Y S I S a n d G E N E R A L I Z E D L I N E A R M O D E L S For Bonnie and Jesse (again) SECOND EDITION A P P L I E D R E G R E S S I O N A N A

### Simple Linear Regression for the MPG Data

Simple Linear Regression for the MPG Data 2000 2500 3000 3500 15 20 25 30 35 40 45 Wgt MPG What do we do with the data? y i = MPG of i th car x i = Weight of i th car i =1,...,n n = Sample Size Exploratory

### The Goodness-of-fit Test for Gumbel Distribution: A Comparative Study

MATEMATIKA, 2012, Volume 28, Number 1, 35 48 c Department of Mathematics, UTM. The Goodness-of-fit Test for Gumbel Distribution: A Comparative Study 1 Nahdiya Zainal Abidin, 2 Mohd Bakri Adam and 3 Habshah

### Statistical Modelling in Stata 5: Linear Models

Statistical Modelling in Stata 5: Linear Models Mark Lunt Arthritis Research UK Epidemiology Unit University of Manchester 07/11/2017 Structure This Week What is a linear model? How good is my model? Does

### Multiple Regression Methods

Chapter 1: Multiple Regression Methods Hildebrand, Ott and Gray Basic Statistical Ideas for Managers Second Edition 1 Learning Objectives for Ch. 1 The Multiple Linear Regression Model How to interpret

### The assumptions are needed to give us... valid standard errors valid confidence intervals valid hypothesis tests and p-values

Statistical Consulting Topics The Bootstrap... The bootstrap is a computer-based method for assigning measures of accuracy to statistical estimates. (Efron and Tibshrani, 1998.) What do we do when our

### Course in Data Science

Course in Data Science About the Course: In this course you will get an introduction to the main tools and ideas which are required for Data Scientist/Business Analyst/Data Analyst. The course gives an

### Tutorial 2: Power and Sample Size for the Paired Sample t-test

Tutorial 2: Power and Sample Size for the Paired Sample t-test Preface Power is the probability that a study will reject the null hypothesis. The estimated probability is a function of sample size, variability,

### Spatial Variation in Infant Mortality with Geographically Weighted Poisson Regression (GWPR) Approach

Spatial Variation in Infant Mortality with Geographically Weighted Poisson Regression (GWPR) Approach Kristina Pestaria Sinaga, Manuntun Hutahaean 2, Petrus Gea 3 1, 2, 3 University of Sumatera Utara,

### s e, which is large when errors are large and small Linear regression model

Linear regression model we assume that two quantitative variables, x and y, are linearly related; that is, the the entire population of (x, y) pairs are related by an ideal population regression line y

### Simple Linear Regression Using Ordinary Least Squares

Simple Linear Regression Using Ordinary Least Squares Purpose: To approximate a linear relationship with a line. Reason: We want to be able to predict Y using X. Definition: The Least Squares Regression

### Bagging During Markov Chain Monte Carlo for Smoother Predictions

Bagging During Markov Chain Monte Carlo for Smoother Predictions Herbert K. H. Lee University of California, Santa Cruz Abstract: Making good predictions from noisy data is a challenging problem. Methods

### Generalized Multivariate Rank Type Test Statistics via Spatial U-Quantiles

Generalized Multivariate Rank Type Test Statistics via Spatial U-Quantiles Weihua Zhou 1 University of North Carolina at Charlotte and Robert Serfling 2 University of Texas at Dallas Final revision for

### Lecture 5. In the last lecture, we covered. This lecture introduces you to

Lecture 5 In the last lecture, we covered. homework 2. The linear regression model (4.) 3. Estimating the coefficients (4.2) This lecture introduces you to. Measures of Fit (4.3) 2. The Least Square Assumptions

### Dr. Maddah ENMG 617 EM Statistics 11/28/12. Multiple Regression (3) (Chapter 15, Hines)

Dr. Maddah ENMG 617 EM Statistics 11/28/12 Multiple Regression (3) (Chapter 15, Hines) Problems in multiple regression: Multicollinearity This arises when the independent variables x 1, x 2,, x k, are

### ECON3150/4150 Spring 2015

ECON3150/4150 Spring 2015 Lecture 3&4 - The linear regression model Siv-Elisabeth Skjelbred University of Oslo January 29, 2015 1 / 67 Chapter 4 in S&W Section 17.1 in S&W (extended OLS assumptions) 2

### Sigmaplot di Systat Software

Sigmaplot di Systat Software SigmaPlot Has Extensive Statistical Analysis Features SigmaPlot is now bundled with SigmaStat as an easy-to-use package for complete graphing and data analysis. The statistical

### F n and theoretical, F 0 CDF values, for the ordered sample

Material E A S E 7 AMPTIAC Jorge Luis Romeu IIT Research Institute Rome, New York STATISTICAL ANALYSIS OF MATERIAL DATA PART III: ON THE APPLICATION OF STATISTICS TO MATERIALS ANALYSIS Introduction This

### ECON Introductory Econometrics. Lecture 5: OLS with One Regressor: Hypothesis Tests

ECON4150 - Introductory Econometrics Lecture 5: OLS with One Regressor: Hypothesis Tests Monique de Haan (moniqued@econ.uio.no) Stock and Watson Chapter 5 Lecture outline 2 Testing Hypotheses about one

### Joseph W. McKean 1. INTRODUCTION

Statistical Science 2004, Vol. 19, No. 4, 562 570 DOI 10.1214/088342304000000549 Institute of Mathematical Statistics, 2004 Robust Analysis of Linear Models Joseph W. McKean Abstract. This paper presents

### Extending the Robust Means Modeling Framework. Alyssa Counsell, Phil Chalmers, Matt Sigal, Rob Cribbie

Extending the Robust Means Modeling Framework Alyssa Counsell, Phil Chalmers, Matt Sigal, Rob Cribbie One-way Independent Subjects Design Model: Y ij = µ + τ j + ε ij, j = 1,, J Y ij = score of the ith

### Simple Linear Regression Model & Introduction to. OLS Estimation

Inside ECOOMICS Introduction to Econometrics Simple Linear Regression Model & Introduction to Introduction OLS Estimation We are interested in a model that explains a variable y in terms of other variables

### Chapter 26 Multiple Regression, Logistic Regression, and Indicator Variables

Chapter 26 Multiple Regression, Logistic Regression, and Indicator Variables 26.1 S 4 /IEE Application Examples: Multiple Regression An S 4 /IEE project was created to improve the 30,000-footlevel metric

### Measurement Error and Linear Regression of Astronomical Data. Brandon Kelly Penn State Summer School in Astrostatistics, June 2007

Measurement Error and Linear Regression of Astronomical Data Brandon Kelly Penn State Summer School in Astrostatistics, June 2007 Classical Regression Model Collect n data points, denote i th pair as (η

### Detection of Outliers in Regression Analysis by Information Criteria

Detection of Outliers in Regression Analysis by Information Criteria Seppo PynnÄonen, Department of Mathematics and Statistics, University of Vaasa, BOX 700, 65101 Vaasa, FINLAND, e-mail sjp@uwasa., home

### Prediction Intervals in the Presence of Outliers

Prediction Intervals in the Presence of Outliers David J. Olive Southern Illinois University July 21, 2003 Abstract This paper presents a simple procedure for computing prediction intervals when the data

### 2 Prediction and Analysis of Variance

2 Prediction and Analysis of Variance Reading: Chapters and 2 of Kennedy A Guide to Econometrics Achen, Christopher H. Interpreting and Using Regression (London: Sage, 982). Chapter 4 of Andy Field, Discovering

### The Data Cleaning Problem: Some Key Issues & Practical Approaches. Ronald K. Pearson

The Data Cleaning Problem: Some Key Issues & Practical Approaches Ronald K. Pearson Daniel Baugh Institute for Functional Genomics and Computational Biology Department of Pathology, Anatomy, and Cell Biology

### FAULT DETECTION AND ISOLATION WITH ROBUST PRINCIPAL COMPONENT ANALYSIS

Int. J. Appl. Math. Comput. Sci., 8, Vol. 8, No. 4, 49 44 DOI:.478/v6-8-38-3 FAULT DETECTION AND ISOLATION WITH ROBUST PRINCIPAL COMPONENT ANALYSIS YVON THARRAULT, GILLES MOUROT, JOSÉ RAGOT, DIDIER MAQUIN

### Sampling Distributions in Regression. Mini-Review: Inference for a Mean. For data (x 1, y 1 ),, (x n, y n ) generated with the SRM,

Department of Statistics The Wharton School University of Pennsylvania Statistics 61 Fall 3 Module 3 Inference about the SRM Mini-Review: Inference for a Mean An ideal setup for inference about a mean

### Chapter 10. Supplemental Text Material

Chater 1. Sulemental Tet Material S1-1. The Covariance Matri of the Regression Coefficients In Section 1-3 of the tetbook, we show that the least squares estimator of β in the linear regression model y=

### Chapter 9. Correlation and Regression

Chapter 9 Correlation and Regression Lesson 9-1/9-2, Part 1 Correlation Registered Florida Pleasure Crafts and Watercraft Related Manatee Deaths 100 80 60 40 20 0 1991 1993 1995 1997 1999 Year Boats in

### Robust Estimation of Cronbach s Alpha

Robust Estimation of Cronbach s Alpha A. Christmann University of Dortmund, Fachbereich Statistik, 44421 Dortmund, Germany. S. Van Aelst Ghent University (UGENT), Department of Applied Mathematics and

### Simple Linear Regression

Simple Linear Regression OI CHAPTER 7 Important Concepts Correlation (r or R) and Coefficient of determination (R 2 ) Interpreting y-intercept and slope coefficients Inference (hypothesis testing and confidence

### 9 Correlation and Regression

9 Correlation and Regression SW, Chapter 12. Suppose we select n = 10 persons from the population of college seniors who plan to take the MCAT exam. Each takes the test, is coached, and then retakes the

### Regression Graphics. R. D. Cook Department of Applied Statistics University of Minnesota St. Paul, MN 55108

Regression Graphics R. D. Cook Department of Applied Statistics University of Minnesota St. Paul, MN 55108 Abstract This article, which is based on an Interface tutorial, presents an overview of regression

### Applications of Information Geometry to Hypothesis Testing and Signal Detection

CMCAA 2016 Applications of Information Geometry to Hypothesis Testing and Signal Detection Yongqiang Cheng National University of Defense Technology July 2016 Outline 1. Principles of Information Geometry

### Robust regression in Stata

The Stata Journal (2009) 9, Number 3, pp. 439 453 Robust regression in Stata Vincenzo Verardi 1 University of Namur (CRED) and Université Libre de Bruxelles (ECARES and CKE) Rempart de la Vierge 8, B-5000

### Least Absolute Value vs. Least Squares Estimation and Inference Procedures in Regression Models with Asymmetric Error Distributions

Journal of Modern Applied Statistical Methods Volume 8 Issue 1 Article 13 5-1-2009 Least Absolute Value vs. Least Squares Estimation and Inference Procedures in Regression Models with Asymmetric Error

### Quantitative Methods for Economics, Finance and Management (A86050 F86050)

Quantitative Methods for Economics, Finance and Management (A86050 F86050) Matteo Manera matteo.manera@unimib.it Marzio Galeotti marzio.galeotti@unimi.it 1 This material is taken and adapted from Guy Judge

### Detection of single influential points in OLS regression model building

Analytica Chimica Acta 439 (2001) 169 191 Tutorial Detection of single influential points in OLS regression model building Milan Meloun a,,jiří Militký b a Department of Analytical Chemistry, Faculty of

### The Bayesian Approach to Multi-equation Econometric Model Estimation

Journal of Statistical and Econometric Methods, vol.3, no.1, 2014, 85-96 ISSN: 2241-0384 (print), 2241-0376 (online) Scienpress Ltd, 2014 The Bayesian Approach to Multi-equation Econometric Model Estimation

### Reliability of inference (1 of 2 lectures)

Reliability of inference (1 of 2 lectures) Ragnar Nymoen University of Oslo 5 March 2013 1 / 19 This lecture (#13 and 14): I The optimality of the OLS estimators and tests depend on the assumptions of

### ASA Section on Survey Research Methods

REGRESSION-BASED STATISTICAL MATCHING: RECENT DEVELOPMENTS Chris Moriarity, Fritz Scheuren Chris Moriarity, U.S. Government Accountability Office, 411 G Street NW, Washington, DC 20548 KEY WORDS: data

### Drift Reduction For Metal-Oxide Sensor Arrays Using Canonical Correlation Regression And Partial Least Squares

Drift Reduction For Metal-Oxide Sensor Arrays Using Canonical Correlation Regression And Partial Least Squares R Gutierrez-Osuna Computer Science Department, Wright State University, Dayton, OH 45435,

### School of Mathematical Sciences. Question 1. Best Subsets Regression

School of Mathematical Sciences MTH5120 Statistical Modelling I Practical 9 and Assignment 8 Solutions Question 1 Best Subsets Regression Response is Crime I n W c e I P a n A E P U U l e Mallows g E P

### Author s Accepted Manuscript

Author s Accepted Manuscript Robust fitting of mixtures using the Trimmed Likelihood Estimator N. Neykov, P. Filzmoser, R. Dimova, P. Neytchev PII: S0167-9473(06)00501-9 DOI: doi:10.1016/j.csda.2006.12.024

### Wooldridge, Introductory Econometrics, 4th ed. Chapter 15: Instrumental variables and two stage least squares

Wooldridge, Introductory Econometrics, 4th ed. Chapter 15: Instrumental variables and two stage least squares Many economic models involve endogeneity: that is, a theoretical relationship does not fit

### On Monitoring Shift in the Mean Processes with. Vector Autoregressive Residual Control Charts of. Individual Observation

Applied Mathematical Sciences, Vol. 8, 14, no. 7, 3491-3499 HIKARI Ltd, www.m-hikari.com http://dx.doi.org/.12988/ams.14.44298 On Monitoring Shift in the Mean Processes with Vector Autoregressive Residual

### STAT Chapter 11: Regression

STAT 515 -- Chapter 11: Regression Mostly we have studied the behavior of a single random variable. Often, however, we gather data on two random variables. We wish to determine: Is there a relationship

### Simple Linear Regression and Correlation. Chapter 12 Stat 4570/5570 Material from Devore s book (Ed 8), and Cengage

12 Simple Linear Regression and Correlation Chapter 12 Stat 4570/5570 Material from Devore s book (Ed 8), and Cengage So far, we tested whether subpopulations means are equal. What happens when we have

### A Mathematical Programming Approach for Improving the Robustness of LAD Regression

A Mathematical Programming Approach for Improving the Robustness of LAD Regression Avi Giloni Sy Syms School of Business Room 428 BH Yeshiva University 500 W 185 St New York, NY 10033 E-mail: agiloni@ymail.yu.edu

### Heteroskedasticity ECONOMETRICS (ECON 360) BEN VAN KAMMEN, PHD

Heteroskedasticity ECONOMETRICS (ECON 360) BEN VAN KAMMEN, PHD Introduction For pedagogical reasons, OLS is presented initially under strong simplifying assumptions. One of these is homoskedastic errors,

### Ch. 17. DETERMINATION OF SAMPLE SIZE

LOGO Ch. 17. DETERMINATION OF SAMPLE SIZE Dr. Werner R. Murhadi www.wernermurhadi.wordpress.com Descriptive and Inferential Statistics descriptive statistics is Statistics which summarize and describe

### Multiple Linear Regression

Andrew Lonardelli December 20, 2013 Multiple Linear Regression 1 Table Of Contents Introduction: p.3 Multiple Linear Regression Model: p.3 Least Squares Estimation of the Parameters: p.4-5 The matrix approach

### OSU Economics 444: Elementary Econometrics. Ch.10 Heteroskedasticity

OSU Economics 444: Elementary Econometrics Ch.0 Heteroskedasticity (Pure) heteroskedasticity is caused by the error term of a correctly speciþed equation: Var(² i )=σ 2 i, i =, 2,,n, i.e., the variance

### High-dimensional data: Exploratory data analysis

High-dimensional data: Exploratory data analysis Mark van de Wiel mark.vdwiel@vumc.nl Department of Epidemiology and Biostatistics, VUmc & Department of Mathematics, VU University Contributions by Wessel

### Do not copy, post, or distribute

14 CORRELATION ANALYSIS AND LINEAR REGRESSION Assessing the Covariability of Two Quantitative Properties 14.0 LEARNING OBJECTIVES In this chapter, we discuss two related techniques for assessing a possible

### Multicollinearity and A Ridge Parameter Estimation Approach

Journal of Modern Applied Statistical Methods Volume 15 Issue Article 5 11-1-016 Multicollinearity and A Ridge Parameter Estimation Approach Ghadban Khalaf King Khalid University, albadran50@yahoo.com

### Applied Multivariate Statistical Modeling Prof. J. Maiti Department of Industrial Engineering and Management Indian Institute of Technology, Kharagpur

Applied Multivariate Statistical Modeling Prof. J. Maiti Department of Industrial Engineering and Management Indian Institute of Technology, Kharagpur Lecture - 29 Multivariate Linear Regression- Model

### Multilevel Statistical Models: 3 rd edition, 2003 Contents

Multilevel Statistical Models: 3 rd edition, 2003 Contents Preface Acknowledgements Notation Two and three level models. A general classification notation and diagram Glossary Chapter 1 An introduction

### MLMED. User Guide. Nicholas J. Rockwood The Ohio State University Beta Version May, 2017

MLMED User Guide Nicholas J. Rockwood The Ohio State University rockwood.19@osu.edu Beta Version May, 2017 MLmed is a computational macro for SPSS that simplifies the fitting of multilevel mediation and

### STATISTICAL INFERENCE FOR SURVEY DATA ANALYSIS

STATISTICAL INFERENCE FOR SURVEY DATA ANALYSIS David A Binder and Georgia R Roberts Methodology Branch, Statistics Canada, Ottawa, ON, Canada K1A 0T6 KEY WORDS: Design-based properties, Informative sampling,

### Approximating the step change point of the process fraction nonconforming using genetic algorithm to optimize the likelihood function

Journal of Industrial and Systems Engineering Vol. 7, No., pp 8-28 Autumn 204 Approximating the step change point of the process fraction nonconforming using genetic algorithm to optimize the likelihood