The Algorithm for Multiple Outliers Detection Against Masking and Swamping Effects

Size: px
Start display at page:

Download "The Algorithm for Multiple Outliers Detection Against Masking and Swamping Effects"

Transcription

1 Int. J. Contemp. Math. Sciences, Vol. 3, 2008, no. 17, The Algorithm for Multiple Outliers Detection Against Masking and Swamping Effects Jung-Tsung Chiang Department of Business Administration Ling Tung University, Taiwan No. 1 Ling-Tung Rd. Taichung city, Taiwan jungtsung@mail2000.com.tw Abstract The Gentleman and Wilk s method for k outliers detection is to find the subgroup of n k observations which has the minimum sum of squared residuals (Min:SSE). The proposed method here modifies it as which has the minimum prediction error sum of squares (Min:PRESS). Next, A fast algorithm to find the best construction data set (containing all good observations) is based on the absolute Jackknife residuals at the first stage. An entire data set is divided into two groups. The first group is a clean one to compute the predicted function, and the other one, containing the outliers, is examined by ADP s (Absolute deviation of predictions). A simulation study and two famous examples are presented. Keywords: OLS method, Masking effect, Swamping effect, Jackknife residuals, Cook s Distance, Data Structure 1 Introduction The OLS method is widely used in linear models to detect outliers. Basically, the outliers in a sample reflect several features, which are from either (1) the errors of measurement or (2) intrinsic variability ( mean shift, inflation of variances or others). In case(1), the outliers should be excluded from the sample or need to be corrected. In case (2), more methods and works will be developed if possible. Several authors studied multiple outliers detection. Gentleman and Wilk (1975)[3] proposed the Deletion Method to identify a subset of k outliers. The observations deleted sequentially that produces the largest reduction in the residual sum of squares. This is also equivalent to finding a n k subset of original data set with a minimum of residual sum of squared errors (MSSE).

2 840 J. T. Chiang A problem arises in substantial computations that may not be available for data with large number of observations. So, Marasinghe (1985)[8] took a new test statistic, F k for detecting multiple outliers in linear regression models. This statistic is derived from a sequential testing procedure. Paul and Fung (1991)[10] studied a general extreme studentized residual (GESR) procedures. They control the type I error and take a two-phase procedure to identify the outliers with high leverage values. Hadi and Simonoff(1993)[5] offered the adjusted residuals as criteria for outliers detection. They introduced two algorithms for detecting multiple outliers. Both methods the approximated MSSE subsets were tested for the no-outlier hypothesis with a combination of single linkage clustering (Hartigan 1981)[6] and back-stepping ( Rosner 1975[11]; Simonff 1984a [13]) to avoid masking effect and swamping effect. Recently, Wei and Fung (1999)[16] proposed the mean-shift outlier model for general weighted regression. In this study, we try to give an explicit and clean definition for outliers based on a best construction data set and find a fast algorithm via datasplitting to identify outliers. A reasonable data-splitting is required on this topic. The problem arises in an effective factor space established to get the least squared equation in which the interpolation rule is attained. Besides, the ordinary least squared method (OLS) must be modified since the distributions of outliers are different from that of the entire clean data set. Next, the influential points have an impact on OLS predicted function. They may suffer from masking and swamping effects (See Chiang, 2007 )[2]. Now, We shall focus on the cases of mean shift expected values (Mean-shift Models) throughout the article. Meanwhile, two famous examples and a simulation study are presented here. 2 The Formulation of the Methods for Multiple Outliers We consider the full rank linear model: Y = Xβ + ɛ (1) where X is a known n p (n p + 1) full rank matrix. β is an unknown p 1 vector, and ɛ is an n 1 error vector with i.i.d as N(0,σ 2 I n ). So the least squares estimated residuals e and its variance are: e = Y Ŷ =(I H)Y and Var(e) =(I H)σ 2, where the hat matrix H = X(X X) 1 X is symmetric and idempotent. Now, we delete the ith observation and use the remaining n 1 observations to calculate the fitted value of the ith case, Ŷi(i). The difference between the observed value Y i and Ŷi(i) is called the deleted residuals of the

3 The algorithm for multiple outliers 841 ith case, denoted by d i = Y i Ŷi(i). And the Jackknife residual r i is defined as r i = e i s (i) 1 hii, (i =1, 2,..., n) (2) where s 2 (i) = Y (i) (I H (i))y (i), and h n p 1 ii = X i (X X) 1 X i, the ith diagonal element of H. The ith observation is deleted and indicated by writing the index (i) in brackets. If rank(x (i) )=p and ɛ are i.i.d with N(0,σ 2 I n ), then the Jackknife residuals ri, (i=1, 2,..., n) are t n p 1 distributed ( Beckman et al., 1974 [1]). Next, the prediction error sum of squares is defined as PRESS = n i=1 d 2 i, where d i = Y i Ŷi(i) is the deleted residuals of the ith case. Then PRESS = n i=1 e 2 i (1 h ii ) 2 (3) The PRESS illustrates that each of all observations can be regarded as a new one when it is deleted from the data set. It is equivalent to cross-validation criterion CV (1) (Stone, 1973 [14]). A model with small PRESS is considered well-fitted. So a modified criterion for identifying a single outlier is given by Min : PRESS Gi,i=1, 2,..., n (4) where G i is the subgroup without the ith observation. For a large sample, the PRESS and SSE are asymptotically equivalent(mcquarrie, 1999 [9]). Thus, the equation (4) is also equivalent to Grubbs test(grubbs, 1950 [4]). Here, we consider a data set M of size n containing k outliers, and the entire clean data set C with n c, (n c = n k), good observations is a subset of M. Now suppose that all good observations of the subset C are from the target distribution as the linear models, i.e., Y c N(X c β,σ 2 I n ). The OLS estimator ˆβ c =(X cx c ) 1 X cy c is invariant if k new observations (X nc+1, Ŷn c+1),..., (X nc+k, Ŷn c+k) are added to the data set C. That is, ˆβ c =(X X) 1 XỸ (5) where X is an n p matrix,ỹ =(Y 1,..., Y nc, Ŷn c+1,..., Ŷn k +1), and Ŷn c+j = X n ˆβ c+j c,j =1,..., k. The new data set B of (X, Ỹ ) is associated with the residuals: (e 1,e 2,..., e nc, 0,..., 0). Next, let the original date set M be expressed as: {(X 1,Y 1 ),..., (X nc,y nc ), (X nc+1, Ŷn c+1 + d nc+1),..., (X nc+k, Ŷn c+k + d nc+k)} where d j is greater than 2σZ 1 α/2, j = n c +1,..., n c + k,and the last k observations of data set M are outliers. Obviously, Y M = Ỹ +ΔY if the ΔY =(0,..., 0,d nc+1,..., d nc+k). The OLS residuals of the original data set M is e =(I H)Y M

4 842 J. T. Chiang =(I H)(Ỹ +ΔY ) =(I H)Ỹ +(I H)ΔY =(e 1,..., e nc, 0,..., 0) +ΔY HΔY =(e 1,..., e nc,d nc+1,..., d nc+k) HΔY = e HΔY where H is an n squared hat matrix. e is an n 1 vector calculated by the clean data set C. Consequently, the associated studentized residuals r i corresponding to e i are r i = e i n j=nc+1 h ij d j s 1 h ii,if i=1, 2,..., n c (6) And r i = d i n h j=nc+1 ijd j s,if i= n c +1,..., n c + k (7) 1 h ii which shows that (1) For i =1, 2,..., n c, there exist some r is are greater than critical points, the swamping effect appears on these cases; that is, they are incorrectly regarded as outliers. (2) For i = n c +1,..., n c + k, there exist some r is are less than the critical points, the masking effect appears on these cases; that is, they are incorrectly regarded as good observations (inliers). As argued above, the masking and swamping effects depend on the locations h ii, correlations h ij the signs of d i, and the permutations of these outliers. Basically, the k outliers can be regarded as a perturbation effects on a clean data set C. In the linear model Y = Xβ + ɛ N(Xβ,σ 2 ), we suppose that the real function Y = Xβ is known. Then the entire subset of k outliers satisfies the condition: k Max : ɛ 2 I j,j =1, 2,..., k (8) j=1 where I j is an arbitrary subset of a sample of size k. It is also equivalent to n Min : ɛ 2 i (9) i I j Basically, there is only one entire subset of k outliers in a sample satisfying the above definition, if the real function is known. Probably, there exist two deleted subsets, say, I 1, and I 2, such that SSE (I1 ) = SSE (I2 ) = Minimum of SSE (Ii ) Thus, a good choice to pick up I 1 or I 2 as an entire subset of k outliers is based on the smaller Tr(X X) 1, which is derived from a shorter length of

5 The algorithm for multiple outliers 843 the confidence interval of ˆβ c. An ideal criterion for a best clean data set is to minimize the the predictive sum of squared errors below: Min : PRESS Gi,i=1, 2,..., C(n, n k) (10) where G i is an arbitrary subgroup with n k observations. Now, if the outliers come from the mean shifts of k observations, then the mean-shift model is expressed as Y = Xβ + D + ɛ (11) where D =(0,..., 0,d n k+1,..., d n ), and d i > 2Z 1 α/2 σ. The {(X i,y i )} n i=n k+1 are k outliers with the mean shifts {d i } n i=n k+1. The best clean data set, denoted by C, is {(X i,y i )} nc i=1, where n c = n k or less. The Predictive Confidence Interval I p computed by the clean data set C is : X i ˆβ C α s c 1+X i (X c X c) 1 X i (12) where s c = MSE c, X =(X c,x i ). Since X i (X cx c ) 1 X i = 1 h ii (See p127 of Applied Linear Regression by Weisberg, 1985 [15]), the I p can be shown to X i ˆβ C αs c 1 hii (13) where the approximate critical points C α = t(1 α/2n c,n c p 1) are calculated by the upper bound of the Bonferroni inequality and a t-distribution of SSE Jackknife residuals. If needed, an adjusted critical value could be t(1 PRESS α/2n c,n c p 1) instead. It depends on the data structure. So, the subset I of k outliers can be obtained by Y i / I p,i=1,..., n (14) Here, only i = n k +1,..., n will be correctly identified for the outlying cases. Besides, the C {i}, i I will form a new data set with only one outlier, and the absolute Jackknife residual ri of the observation is greater than the critical values C α. Note that the ideal subset C of good observations, the size n c = n k may be reduced to n k 1, n k 2 orn k 3. It depends on the subjective viewpoints of researchers on residuals plots and n c i=1 sign(e i ) 0 for a large sample of size. h ii 3 The Algorithm, Famous Examples, and Simulation Studies An approach to detection of outliers in linear model is available through data split. If an entire data set of size n contains k outliers, then all C(n, n

6 844 J. T. Chiang k) partitions of the entire data set are considered for construction data sets G i,( i =1, 2,..., C(n, n k)). A best construction data set with n k clean observations is based on: Min : PRESS Gi,i=1, 2,..., C(n, n k) However, these methods involve a deal of computational effort. algorithms of data-splitting is proposed below. So, a fast k = 1, one outlier A single outlier is easy to identify based on Maximum of Jackknife residuals. That is, leave-one-out procedure is the best way for it. Actually, the k is unknown, so we take another way of data-splitting to find the subset of k most likely outliers. Jackknife residuals ri and Cook s Distance D i are used to pick up the construction data set at first stage. ** Algorithm : k = unknown,?? Outliers Step 1: using all observations to calculate each Jackknife residual r i,(i = 1,..., n) of all observations. Step 2: Arrange the absolute r i,( r i ), in ascending order, r (1), r (2),..., r (n), where r (1) = min{ r i }n i=1, and r (n) = max{ r i }n i=1. Step 3: Choose the observations {r (i) }[0.qn] i=1,(0.8 q 0.95) as a construction data set C 1, and the remaining observations form a validation data set V 1. Step 4: Confirm the data set C 1 is a clean one based on the plot of Jackknife residuals. The maximum of absolute Jackknife residuals of C 1 is less than the critical points C α. Step 5: Calculate the ADP (absolute deviation of predictions) for n v observations ; that is, Y i,v1 Ŷi,C 1, i =1, 2,..., n v. Next, we remove some observations of V 1 to C 1 based on their smaller values of ADP s, ( try ADP s<c α MSEc ), and get a new construction data set C 2. The remaining observations is denoted by V 2. Thus, the new predicted value Ŷi,C 2 is computed with the new data C 2. Step 6: In a similar way, we repeat step 4 by removing the new observations with smaller ADP s to C 2, and get a new set C 3,..., and so on. Step 4 and Step 5 are used repeatedly to obtain the best construction data set of size n k. Thus, the remaining k observations are outliers since they do not belong to I p. Besides, if Cook s Distance D i replaces r(i) in step 1, step 2 and step 3, this

7 The algorithm for multiple outliers 845 may result in a slightly different outcomes. The relations between D i and ri is (n p)h ii (ri D i = )2 (15) p(1 h ii )(n p 1+(ri ) 2 ) which is derived from Cook s Distance, the relations of r and r. Some of these points will be elaborated in the following subsections. 3.1 Two Famous Examples (1) Hawkins, Bradu and Kass s Constructed Data The data set is reproduced by the above authors (See Table 1)[7], and the Table 2 and Table 3 are completed by Jackknife residuals and the related statistics (Fig 1 to Fig 3) to find the best construction subset. Actually, the data contains outliers at cases 1 10 but they are undetected owing to the masking effect. The Jackknife residuals of which are between 1.18 and And the good observations 11 an 12 are incorrectly identified as outliers owing to swamping effect. Their Jackknife residuals are 4.03 and 5.29, respectively. Besides, the OLS methods used for the entire data set fail to identify outliers except Hadi s algorithm. However, we propose a fast algorithm to find outliers listed below: Step 1: Use the original data C = {1, 2,..., 75}to calculate Jackknife residuals of all observations. Step 2: Delete the subset V 1 = {1, 2,..., 14, 44}, since their absolute Jackknife residuals are ranked at top 15 ones. The new construction data set C 1 = C V 1 is a clean one based on the plot of Jackknife residuals, and the corresponding predicted function is Ŷ = X X X3. Step 3: Compute the ADP s of V 1, i.e., { Y i,v1 Ŷi,C 1 },and move back the cases 11 14, 44 to C 1 owing to their values of ADP s less than C α MSEc1. Hence the construction data set C 2 = C 1 {11 14, 44} is newer one to obtain the new predicted function Ŷ = X X X3. Step 4: C 2 is a good clean data set based on the plot of residuals. (See Fig 4 ). Step 5: Examining the data sets C 3,C 4,..., C 10, there exists only one outlier for each subset based on the Jackknife residuals. It is worth noting that the observations 1 10 are significantly outlying, since the values of r are , ,..., much greater than the critical point C α. Thus we declare that observations 1 10 are outliers in the original data set. Similarly, we also analyzed Hadi and Simonoff data set with the satisfactory results. (2)Hadi and Simonoff Artificial Data

8 846 J. T. Chiang The data set is created by Hadi and Simonoff (1993)[5] (See Table 4). First, they picked two predictors X1 and X2 distributed as uniform (0, 15), and gave the regression function Y = X1+X2+ɛ,where ɛ N(0, 1). Secondly, let the observations 1 3 be added by the quantity 4 to form a new mean-shift model as Y = X1+X2+4+ɛ. The outliers 1 3 are unidentified due to the masking effects (See Table 5). The OLS method is a drawback for outliers detection here. The alternative methods, such as LMS and LTS (Rousseeuw, 1984)[12] fail to identify them. A modified method proposed by Hadi and Simonoff had detected the three outliers successfully but a little troublesome. In this example, we get a well-fitted regression function Ŷ = X X2 based on a clean data set, and the observations 1, 2, 3 are outliers. Basically, the masking effect appears in the entire data set, since the three outliers 1, 2, 3 are at the same side, and their leverage values are very close together. For detailed discussion, it refers to the section 2, and the mean-shift value of 4 corresponds to the situation of this example. 3.2 Simulation Studies for Samples with two and three Outliers Example 1: Y =1+X + ɛ, ɛ N(0, 1) The data set of size 51 is generated from the linear model Y = 1+X +ɛ, where X i U(0, 10) and ɛ N(0, 1). There are two outliers planted at observations of 1 and 20 by adding the quantity of 5 to the null data set (See Table 6). Example 2: Y =2+X1+X2+ɛ, ɛ N(0, 1) The data set of size 60 is generated from the linear model Y = 2+X1+X2+ɛ, where X 1i U(0, 10),X 2i U(0, 15) and ɛ N(0, 1). There are three outliers planted at observations of 1, 2, 3 by adding the quantity of 4.5 to the null data set (See Table 7). Still, we get a satisfactory result for outliers detection through the above simulation studies. 4 Conclusions The OLS method is widely used in linear regression models. The masking and swamping effects are still unavoidable if the data set contains a few high leverage points. In general, the data structure has provided much information on this topic. The Jackknife residuals and Cook s Distance are good indicators here. In section 2, we have shown that the locations of outliers, their signs of residuals and permutations of all outliers are the main factors for masking and swamping effects. Next, Gentleman and Wilk s method, i.e., Min : SSE Gi for multiple outliers can be modified to Min : PRESS Gi. In this study, a

9 The algorithm for multiple outliers 847 useful and fast algorithm to find multiple outliers is based on data-splitting and Jackknife residuals, and the two famous examples in section 3 illustrate it is much simpler than GESR, Multistage Procedure and Hadi s Algorithms. Meanwhile, the diagnosis of a single outlier in linear models can be extended to the multiple outliers detection, and hence the masking and swamping effects is not a problem. Acknowledgements The author would like to thank his advisor Dr. Kenny Ye, Prof. Nancy Mendell, and Prof. Hongshik Ahn for their assistances in modifying the paper. This paper is a part of his doctoral dissertation in August 2002, AMS Dept. of SUNY-Stony Brook, USA. References [1] R.J. Beckman and H.J. Trussell, The distribution of an arbitrary studentized residual and effects of updating in multiple regression, J. Amer. Statistic. Assn., 69(1974), [2] J.T. Chiang, The Masking and Swamping Effects Using the Planted Mean- Shift Outliers Models, Int. Journal of Contemp. Math. Sciences, Vol.2, 7(2007), [3] J.F. Gentleman and M.B. Wilk, Detecting Outliers:II Supplementing the direct analysis of residuals,biometrics, 31(1975a), [4] Grubbs and E. Frank, Sample Criteria for Testing Outlying Observations, Annals of Mathematical Statistics, 21(1950), [5] A.S Hadi and J.S Simonoff, Procedures for the Identification of Multiple Outlier in Linear Models, Journal of the American Statistical Association, 88(1993), Issue 424, [6] J.A. Hartigan, Consistency of Single Linkage for high-density Cluster, Journal of the American Statistical Association, 76(1981), [7] D.M. Hawkins, D. Bradu and G.V. Kass, Location of several outliers in multiple-regression data using elemental sets, Technometrics, 26(1984), [8] Marasinghe and G. Mervyn, A Multistage Procedure for Detecting Several Outliers in Linear Regression, Technometrics, 27 (1985),

10 848 J. T. Chiang [9] McQuarrie, D R Allan and Tsai, Chih-Ling, Regression and Times Series, Model Selection, World Scientific Publishing Co. Pte. Ltd.,1999. [10] S.R. Paul and Y. Fung, A Generalized Extreme Studentized Residual Multiple-Outlier-Detection Procedure in Linear Regression, Technometrics, 33(1991), [11] B.Rosner, On the Detection of Many Outliers, Technometrics, 17(1975), [12] P.J.Rousseeuw, Least median of squares regression. Journal of American Statistical Association, 79(1984), [13] J.S. Simonoff, The Calculation of Outlier Detection Statistics, Communications in Statistics, Part B-Simulation and Computation, 13(1984a), [14] M. Stone, Cross-validatory choice and Assessment of Statistical Predictions, Journal of the Royal Statistical Society, Ser. B, 36(1973), [15] Weisberg and Sanford, Applied Linear Regression, John Wiley and Sons, Inc., [16] W.H. Wei and W.K. Fung, The mean-shift outlier model in general weighted regression and its applications, Computational Statistics and Data Analysis, 30(1999),

11 The algorithm for multiple outliers 849 Table 1: Hawkins, Bradu and Kass s Constructed Data Obs Y X1 X2 X3 Obs Y X1 X2 X ,

12 850 J. T. Chiang Table 2: The Cook s Distance D i and related Statistics in Hawkins data. Obs D i r h ii CV R DFFITS *The observations of the top fifteen largest D i are 1 14, 43, a slightly different from that of the top fifteen largest ri, 1 14, 44. The fifteen cases are deleted at the first stage.

13 The algorithm for multiple outliers 851 Table 3: Obtaining the Best Construction Data Set step by step on Hawkin et al. data set V i Observations Max r of C i Construction data set C i V , (53) C 1 = C V 1 V (53) C 2 = C V 2 V (10) C 3 = C V 3 V 4 1-8, (9) C 4 = C V 4 V 5 1-7, 9, (8) C 5 = C V 5 V 6 1-6, (7) C 6 = C V 6 V 7 1-5, (6) C 7 = C V 7 V 8 1-4, (5) C 8 = C V 8 V 9 1-3, (4) C 9 = C V 9 V , (3) C 10 = C V 10 V 11 1, (2) C 11 = C V 11 V (1) C 12 = C V 12 V (12) C ( the original data set) *In the third column, the number in () presents the observation. The best construction of size 65 is C 2 = {11, 12,..., 75} here.

14 852 J. T. Chiang Table 4: Hadi and Simonoff s Artificial Data Obs Y X1 X

15 The algorithm for multiple outliers 853 Table 5: Deleting Subsets based on Jackknife residuals r Obs r r r r r r Outlier Yes Yes Yes No No No No No No No No No No No No No No No No No No No No No No *The best construction data set is C {1, 2, 3}, and Obs of 1, 2, 3 are outliers, computing the Hadi and Simonoff s data set.

16 854 J. T. Chiang Table 6: The Simulation Data Set with two outliers Obs Y X Obs Y X The sample of size 51 is generated by R-project, version of R1.4.1

17 The algorithm for multiple outliers 855 Table 7: The Simulation Data Set with three outliers Obs Y X1 X2 Obs Y X1 X The sample of size 60 is generated by R-project, version of R1.4.1

18 856 J. T. Chiang jack Index Figure 1: The scatter plot of Jackknife residuals of the Hawkins et al. data set shows that the linear model is not well-fitted for the entire one. *Several outliers make perturbation effects on this data set.

19 The algorithm for multiple outliers 857 Diagonal of hat matrix i Figure 2: Index plot of leverage measure of Hawkins et al. data set. *The horizontal line is at mean of h values, and the segments function joins pairs of points by a line. Obs 14 is a high leverage point.

20 858 J. T. Chiang The observations of larger values of Cook Distance Absolute Jackknife Residuals Cook s Distance Figure 3: The scatter plot of absolute Jackknife residuals vs. Cook s Distance using the Hawkins s data. (n p)h *Note that D i = ii (ri )2 p(1 h ii )(n p 1+(ri )2 ) is not a one to one correspondence if we map (h ii, ri ) to D i.

21 The algorithm for multiple outliers 859 Index plot of residuals Residuals Index Figure 4: The scatter plot of residuals shows that the linear model is well fitted for the subset C 2. Received: December 9, 2007

The Masking and Swamping Effects Using the Planted Mean-Shift Outliers Models

The Masking and Swamping Effects Using the Planted Mean-Shift Outliers Models Int. J. Contemp. Math. Sciences, Vol. 2, 2007, no. 7, 297-307 The Masking and Swamping Effects Using the Planted Mean-Shift Outliers Models Jung-Tsung Chiang Department of Business Administration Ling

More information

Regression Review. Statistics 149. Spring Copyright c 2006 by Mark E. Irwin

Regression Review. Statistics 149. Spring Copyright c 2006 by Mark E. Irwin Regression Review Statistics 149 Spring 2006 Copyright c 2006 by Mark E. Irwin Matrix Approach to Regression Linear Model: Y i = β 0 + β 1 X i1 +... + β p X ip + ɛ i ; ɛ i iid N(0, σ 2 ), i = 1,..., n

More information

IDENTIFYING MULTIPLE OUTLIERS IN LINEAR REGRESSION : ROBUST FIT AND CLUSTERING APPROACH

IDENTIFYING MULTIPLE OUTLIERS IN LINEAR REGRESSION : ROBUST FIT AND CLUSTERING APPROACH SESSION X : THEORY OF DEFORMATION ANALYSIS II IDENTIFYING MULTIPLE OUTLIERS IN LINEAR REGRESSION : ROBUST FIT AND CLUSTERING APPROACH Robiah Adnan 2 Halim Setan 3 Mohd Nor Mohamad Faculty of Science, Universiti

More information

Math 423/533: The Main Theoretical Topics

Math 423/533: The Main Theoretical Topics Math 423/533: The Main Theoretical Topics Notation sample size n, data index i number of predictors, p (p = 2 for simple linear regression) y i : response for individual i x i = (x i1,..., x ip ) (1 p)

More information

Contents. 1 Review of Residuals. 2 Detecting Outliers. 3 Influential Observations. 4 Multicollinearity and its Effects

Contents. 1 Review of Residuals. 2 Detecting Outliers. 3 Influential Observations. 4 Multicollinearity and its Effects Contents 1 Review of Residuals 2 Detecting Outliers 3 Influential Observations 4 Multicollinearity and its Effects W. Zhou (Colorado State University) STAT 540 July 6th, 2015 1 / 32 Model Diagnostics:

More information

Outlier detection and variable selection via difference based regression model and penalized regression

Outlier detection and variable selection via difference based regression model and penalized regression Journal of the Korean Data & Information Science Society 2018, 29(3), 815 825 http://dx.doi.org/10.7465/jkdi.2018.29.3.815 한국데이터정보과학회지 Outlier detection and variable selection via difference based regression

More information

Lecture 1: Linear Models and Applications

Lecture 1: Linear Models and Applications Lecture 1: Linear Models and Applications Claudia Czado TU München c (Claudia Czado, TU Munich) ZFS/IMS Göttingen 2004 0 Overview Introduction to linear models Exploratory data analysis (EDA) Estimation

More information

Ch 3: Multiple Linear Regression

Ch 3: Multiple Linear Regression Ch 3: Multiple Linear Regression 1. Multiple Linear Regression Model Multiple regression model has more than one regressor. For example, we have one response variable and two regressor variables: 1. delivery

More information

Review: Second Half of Course Stat 704: Data Analysis I, Fall 2014

Review: Second Half of Course Stat 704: Data Analysis I, Fall 2014 Review: Second Half of Course Stat 704: Data Analysis I, Fall 2014 Tim Hanson, Ph.D. University of South Carolina T. Hanson (USC) Stat 704: Data Analysis I, Fall 2014 1 / 13 Chapter 8: Polynomials & Interactions

More information

A CONNECTION BETWEEN LOCAL AND DELETION INFLUENCE

A CONNECTION BETWEEN LOCAL AND DELETION INFLUENCE Sankhyā : The Indian Journal of Statistics 2000, Volume 62, Series A, Pt. 1, pp. 144 149 A CONNECTION BETWEEN LOCAL AND DELETION INFLUENCE By M. MERCEDES SUÁREZ RANCEL and MIGUEL A. GONZÁLEZ SIERRA University

More information

Prediction Intervals in the Presence of Outliers

Prediction Intervals in the Presence of Outliers Prediction Intervals in the Presence of Outliers David J. Olive Southern Illinois University July 21, 2003 Abstract This paper presents a simple procedure for computing prediction intervals when the data

More information

Regression Analysis for Data Containing Outliers and High Leverage Points

Regression Analysis for Data Containing Outliers and High Leverage Points Alabama Journal of Mathematics 39 (2015) ISSN 2373-0404 Regression Analysis for Data Containing Outliers and High Leverage Points Asim Kumer Dey Department of Mathematics Lamar University Md. Amir Hossain

More information

Kutlwano K.K.M. Ramaboa. Thesis presented for the Degree of DOCTOR OF PHILOSOPHY. in the Department of Statistical Sciences Faculty of Science

Kutlwano K.K.M. Ramaboa. Thesis presented for the Degree of DOCTOR OF PHILOSOPHY. in the Department of Statistical Sciences Faculty of Science Contributions to Linear Regression Diagnostics using the Singular Value Decomposition: Measures to Identify Outlying Observations, Influential Observations and Collinearity in Multivariate Data Kutlwano

More information

Linear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept,

Linear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept, Linear Regression In this problem sheet, we consider the problem of linear regression with p predictors and one intercept, y = Xβ + ɛ, where y t = (y 1,..., y n ) is the column vector of target values,

More information

Regression Diagnostics for Survey Data

Regression Diagnostics for Survey Data Regression Diagnostics for Survey Data Richard Valliant Joint Program in Survey Methodology, University of Maryland and University of Michigan USA Jianzhu Li (Westat), Dan Liao (JPSM) 1 Introduction Topics

More information

Detection of Outliers in Regression Analysis by Information Criteria

Detection of Outliers in Regression Analysis by Information Criteria Detection of Outliers in Regression Analysis by Information Criteria Seppo PynnÄonen, Department of Mathematics and Statistics, University of Vaasa, BOX 700, 65101 Vaasa, FINLAND, e-mail sjp@uwasa., home

More information

STAT 540: Data Analysis and Regression

STAT 540: Data Analysis and Regression STAT 540: Data Analysis and Regression Wen Zhou http://www.stat.colostate.edu/~riczw/ Email: riczw@stat.colostate.edu Department of Statistics Colorado State University Fall 205 W. Zhou (Colorado State

More information

Generalized Linear Models

Generalized Linear Models Generalized Linear Models Lecture 3. Hypothesis testing. Goodness of Fit. Model diagnostics GLM (Spring, 2018) Lecture 3 1 / 34 Models Let M(X r ) be a model with design matrix X r (with r columns) r n

More information

STAT5044: Regression and Anova

STAT5044: Regression and Anova STAT5044: Regression and Anova Inyoung Kim 1 / 49 Outline 1 How to check assumptions 2 / 49 Assumption Linearity: scatter plot, residual plot Randomness: Run test, Durbin-Watson test when the data can

More information

STATISTICS 174: APPLIED STATISTICS FINAL EXAM DECEMBER 10, 2002

STATISTICS 174: APPLIED STATISTICS FINAL EXAM DECEMBER 10, 2002 Time allowed: 3 HOURS. STATISTICS 174: APPLIED STATISTICS FINAL EXAM DECEMBER 10, 2002 This is an open book exam: all course notes and the text are allowed, and you are expected to use your own calculator.

More information

Diagnostics for Linear Models With Functional Responses

Diagnostics for Linear Models With Functional Responses Diagnostics for Linear Models With Functional Responses Qing Shen Edmunds.com Inc. 2401 Colorado Ave., Suite 250 Santa Monica, CA 90404 (shenqing26@hotmail.com) Hongquan Xu Department of Statistics University

More information

UNIVERSITY OF MASSACHUSETTS. Department of Mathematics and Statistics. Basic Exam - Applied Statistics. Tuesday, January 17, 2017

UNIVERSITY OF MASSACHUSETTS. Department of Mathematics and Statistics. Basic Exam - Applied Statistics. Tuesday, January 17, 2017 UNIVERSITY OF MASSACHUSETTS Department of Mathematics and Statistics Basic Exam - Applied Statistics Tuesday, January 17, 2017 Work all problems 60 points are needed to pass at the Masters Level and 75

More information

18.S096 Problem Set 3 Fall 2013 Regression Analysis Due Date: 10/8/2013

18.S096 Problem Set 3 Fall 2013 Regression Analysis Due Date: 10/8/2013 18.S096 Problem Set 3 Fall 013 Regression Analysis Due Date: 10/8/013 he Projection( Hat ) Matrix and Case Influence/Leverage Recall the setup for a linear regression model y = Xβ + ɛ where y and ɛ are

More information

CHAPTER 5. Outlier Detection in Multivariate Data

CHAPTER 5. Outlier Detection in Multivariate Data CHAPTER 5 Outlier Detection in Multivariate Data 5.1 Introduction Multivariate outlier detection is the important task of statistical analysis of multivariate data. Many methods have been proposed for

More information

Final Review. Yang Feng. Yang Feng (Columbia University) Final Review 1 / 58

Final Review. Yang Feng.   Yang Feng (Columbia University) Final Review 1 / 58 Final Review Yang Feng http://www.stat.columbia.edu/~yangfeng Yang Feng (Columbia University) Final Review 1 / 58 Outline 1 Multiple Linear Regression (Estimation, Inference) 2 Special Topics for Multiple

More information

A Simple Plot for Model Assessment

A Simple Plot for Model Assessment A Simple Plot for Model Assessment David J. Olive Southern Illinois University September 16, 2005 Abstract Regression is the study of the conditional distribution y x of the response y given the predictors

More information

Lecture 5: Clustering, Linear Regression

Lecture 5: Clustering, Linear Regression Lecture 5: Clustering, Linear Regression Reading: Chapter 10, Sections 3.1-2 STATS 202: Data mining and analysis Sergio Bacallado September 19, 2018 1 / 23 Announcements Starting next week, Julia Fukuyama

More information

K. Model Diagnostics. residuals ˆɛ ij = Y ij ˆµ i N = Y ij Ȳ i semi-studentized residuals ω ij = ˆɛ ij. studentized deleted residuals ɛ ij =

K. Model Diagnostics. residuals ˆɛ ij = Y ij ˆµ i N = Y ij Ȳ i semi-studentized residuals ω ij = ˆɛ ij. studentized deleted residuals ɛ ij = K. Model Diagnostics We ve already seen how to check model assumptions prior to fitting a one-way ANOVA. Diagnostics carried out after model fitting by using residuals are more informative for assessing

More information

Quantitative Methods I: Regression diagnostics

Quantitative Methods I: Regression diagnostics Quantitative Methods I: Regression University College Dublin 10 December 2014 1 Assumptions and errors 2 3 4 Outline Assumptions and errors 1 Assumptions and errors 2 3 4 Assumptions: specification Linear

More information

Multiple Linear Regression

Multiple Linear Regression Multiple Linear Regression University of California, San Diego Instructor: Ery Arias-Castro http://math.ucsd.edu/~eariasca/teaching.html 1 / 42 Passenger car mileage Consider the carmpg dataset taken from

More information

Outline. Remedial Measures) Extra Sums of Squares Standardized Version of the Multiple Regression Model

Outline. Remedial Measures) Extra Sums of Squares Standardized Version of the Multiple Regression Model Outline 1 Multiple Linear Regression (Estimation, Inference, Diagnostics and Remedial Measures) 2 Special Topics for Multiple Regression Extra Sums of Squares Standardized Version of the Multiple Regression

More information

STAT 4385 Topic 06: Model Diagnostics

STAT 4385 Topic 06: Model Diagnostics STAT 4385 Topic 06: Xiaogang Su, Ph.D. Department of Mathematical Science University of Texas at El Paso xsu@utep.edu Spring, 2016 1/ 40 Outline Several Types of Residuals Raw, Standardized, Studentized

More information

Two Simple Resistant Regression Estimators

Two Simple Resistant Regression Estimators Two Simple Resistant Regression Estimators David J. Olive Southern Illinois University January 13, 2005 Abstract Two simple resistant regression estimators with O P (n 1/2 ) convergence rate are presented.

More information

Prediction Intervals for Regression Models

Prediction Intervals for Regression Models Southern Illinois University Carbondale OpenSIUC Articles and Preprints Department of Mathematics 3-2007 Prediction Intervals for Regression Models David J. Olive Southern Illinois University Carbondale,

More information

Lectures on Simple Linear Regression Stat 431, Summer 2012

Lectures on Simple Linear Regression Stat 431, Summer 2012 Lectures on Simple Linear Regression Stat 43, Summer 0 Hyunseung Kang July 6-8, 0 Last Updated: July 8, 0 :59PM Introduction Previously, we have been investigating various properties of the population

More information

COMPREHENSIVE WRITTEN EXAMINATION, PAPER III FRIDAY AUGUST 26, 2005, 9:00 A.M. 1:00 P.M. STATISTICS 174 QUESTION

COMPREHENSIVE WRITTEN EXAMINATION, PAPER III FRIDAY AUGUST 26, 2005, 9:00 A.M. 1:00 P.M. STATISTICS 174 QUESTION COMPREHENSIVE WRITTEN EXAMINATION, PAPER III FRIDAY AUGUST 26, 2005, 9:00 A.M. 1:00 P.M. STATISTICS 174 QUESTION Answer all parts. Closed book, calculators allowed. It is important to show all working,

More information

Remedial Measures, Brown-Forsythe test, F test

Remedial Measures, Brown-Forsythe test, F test Remedial Measures, Brown-Forsythe test, F test Dr. Frank Wood Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 7, Slide 1 Remedial Measures How do we know that the regression function

More information

Construction and analysis of Es 2 efficient supersaturated designs

Construction and analysis of Es 2 efficient supersaturated designs Construction and analysis of Es 2 efficient supersaturated designs Yufeng Liu a Shiling Ruan b Angela M. Dean b, a Department of Statistics and Operations Research, Carolina Center for Genome Sciences,

More information

, (1) e i = ˆσ 1 h ii. c 2016, Jeffrey S. Simonoff 1

, (1) e i = ˆσ 1 h ii. c 2016, Jeffrey S. Simonoff 1 Regression diagnostics As is true of all statistical methodologies, linear regression analysis can be a very effective way to model data, as along as the assumptions being made are true. For the regression

More information

14 Multiple Linear Regression

14 Multiple Linear Regression B.Sc./Cert./M.Sc. Qualif. - Statistics: Theory and Practice 14 Multiple Linear Regression 14.1 The multiple linear regression model In simple linear regression, the response variable y is expressed in

More information

Multiple Linear Regression

Multiple Linear Regression Multiple Linear Regression Simple linear regression tries to fit a simple line between two variables Y and X. If X is linearly related to Y this explains some of the variability in Y. In most cases, there

More information

Leverage. the response is in line with the other values, or the high leverage has caused the fitted model to be pulled toward the observed response.

Leverage. the response is in line with the other values, or the high leverage has caused the fitted model to be pulled toward the observed response. Leverage Some cases have high leverage, the potential to greatly affect the fit. These cases are outliers in the space of predictors. Often the residuals for these cases are not large because the response

More information

Lecture 5: Clustering, Linear Regression

Lecture 5: Clustering, Linear Regression Lecture 5: Clustering, Linear Regression Reading: Chapter 10, Sections 3.1-3.2 STATS 202: Data mining and analysis October 4, 2017 1 / 22 .0.0 5 5 1.0 7 5 X2 X2 7 1.5 1.0 0.5 3 1 2 Hierarchical clustering

More information

MS&E 226: Small Data

MS&E 226: Small Data MS&E 226: Small Data Lecture 15: Examples of hypothesis tests (v5) Ramesh Johari ramesh.johari@stanford.edu 1 / 32 The recipe 2 / 32 The hypothesis testing recipe In this lecture we repeatedly apply the

More information

STATISTICS 479 Exam II (100 points)

STATISTICS 479 Exam II (100 points) Name STATISTICS 79 Exam II (1 points) 1. A SAS data set was created using the following input statement: Answer parts(a) to (e) below. input State $ City $ Pop199 Income Housing Electric; (a) () Give the

More information

Model Selection. Frank Wood. December 10, 2009

Model Selection. Frank Wood. December 10, 2009 Model Selection Frank Wood December 10, 2009 Standard Linear Regression Recipe Identify the explanatory variables Decide the functional forms in which the explanatory variables can enter the model Decide

More information

Detecting outliers and/or leverage points: a robust two-stage procedure with bootstrap cut-off points

Detecting outliers and/or leverage points: a robust two-stage procedure with bootstrap cut-off points Detecting outliers and/or leverage points: a robust two-stage procedure with bootstrap cut-off points Ettore Marubini (1), Annalisa Orenti (1) Background: Identification and assessment of outliers, have

More information

Improved Feasible Solution Algorithms for. High Breakdown Estimation. Douglas M. Hawkins. David J. Olive. Department of Applied Statistics

Improved Feasible Solution Algorithms for. High Breakdown Estimation. Douglas M. Hawkins. David J. Olive. Department of Applied Statistics Improved Feasible Solution Algorithms for High Breakdown Estimation Douglas M. Hawkins David J. Olive Department of Applied Statistics University of Minnesota St Paul, MN 55108 Abstract High breakdown

More information

Multivariate Regression (Chapter 10)

Multivariate Regression (Chapter 10) Multivariate Regression (Chapter 10) This week we ll cover multivariate regression and maybe a bit of canonical correlation. Today we ll mostly review univariate multivariate regression. With multivariate

More information

A Note on Visualizing Response Transformations in Regression

A Note on Visualizing Response Transformations in Regression Southern Illinois University Carbondale OpenSIUC Articles and Preprints Department of Mathematics 11-2001 A Note on Visualizing Response Transformations in Regression R. Dennis Cook University of Minnesota

More information

Ch 2: Simple Linear Regression

Ch 2: Simple Linear Regression Ch 2: Simple Linear Regression 1. Simple Linear Regression Model A simple regression model with a single regressor x is y = β 0 + β 1 x + ɛ, where we assume that the error ɛ is independent random component

More information

1 Least Squares Estimation - multiple regression.

1 Least Squares Estimation - multiple regression. Introduction to multiple regression. Fall 2010 1 Least Squares Estimation - multiple regression. Let y = {y 1,, y n } be a n 1 vector of dependent variable observations. Let β = {β 0, β 1 } be the 2 1

More information

Lecture 5: Clustering, Linear Regression

Lecture 5: Clustering, Linear Regression Lecture 5: Clustering, Linear Regression Reading: Chapter 10, Sections 3.1-3.2 STATS 202: Data mining and analysis October 4, 2017 1 / 22 Hierarchical clustering Most algorithms for hierarchical clustering

More information

Detection of single influential points in OLS regression model building

Detection of single influential points in OLS regression model building Analytica Chimica Acta 439 (2001) 169 191 Tutorial Detection of single influential points in OLS regression model building Milan Meloun a,,jiří Militký b a Department of Analytical Chemistry, Faculty of

More information

Chapter 1 Statistical Inference

Chapter 1 Statistical Inference Chapter 1 Statistical Inference causal inference To infer causality, you need a randomized experiment (or a huge observational study and lots of outside information). inference to populations Generalizations

More information

Linear models and their mathematical foundations: Simple linear regression

Linear models and their mathematical foundations: Simple linear regression Linear models and their mathematical foundations: Simple linear regression Steffen Unkel Department of Medical Statistics University Medical Center Göttingen, Germany Winter term 2018/19 1/21 Introduction

More information

Simple Regression Model Setup Estimation Inference Prediction. Model Diagnostic. Multiple Regression. Model Setup and Estimation.

Simple Regression Model Setup Estimation Inference Prediction. Model Diagnostic. Multiple Regression. Model Setup and Estimation. Statistical Computation Math 475 Jimin Ding Department of Mathematics Washington University in St. Louis www.math.wustl.edu/ jmding/math475/index.html October 10, 2013 Ridge Part IV October 10, 2013 1

More information

Regression diagnostics

Regression diagnostics Regression diagnostics Leiden University Leiden, 30 April 2018 Outline 1 Error assumptions Introduction Variance Normality 2 Residual vs error Outliers Influential observations Introduction Errors and

More information

Chapter 10 Building the Regression Model II: Diagnostics

Chapter 10 Building the Regression Model II: Diagnostics Chapter 10 Building the Regression Model II: Diagnostics 許湘伶 Applied Linear Regression Models (Kutner, Nachtsheim, Neter, Li) hsuhl (NUK) LR Chap 10 1 / 41 10.1 Model Adequacy for a Predictor Variable-Added

More information

Any of 27 linear and nonlinear models may be fit. The output parallels that of the Simple Regression procedure.

Any of 27 linear and nonlinear models may be fit. The output parallels that of the Simple Regression procedure. STATGRAPHICS Rev. 9/13/213 Calibration Models Summary... 1 Data Input... 3 Analysis Summary... 5 Analysis Options... 7 Plot of Fitted Model... 9 Predicted Values... 1 Confidence Intervals... 11 Observed

More information

Linear Regression Models

Linear Regression Models Linear Regression Models November 13, 2018 1 / 89 1 Basic framework Model specification and assumptions Parameter estimation: least squares method Coefficient of determination R 2 Properties of the least

More information

Tentative solutions TMA4255 Applied Statistics 16 May, 2015

Tentative solutions TMA4255 Applied Statistics 16 May, 2015 Norwegian University of Science and Technology Department of Mathematical Sciences Page of 9 Tentative solutions TMA455 Applied Statistics 6 May, 05 Problem Manufacturer of fertilizers a) Are these independent

More information

Dr. Maddah ENMG 617 EM Statistics 11/28/12. Multiple Regression (3) (Chapter 15, Hines)

Dr. Maddah ENMG 617 EM Statistics 11/28/12. Multiple Regression (3) (Chapter 15, Hines) Dr. Maddah ENMG 617 EM Statistics 11/28/12 Multiple Regression (3) (Chapter 15, Hines) Problems in multiple regression: Multicollinearity This arises when the independent variables x 1, x 2,, x k, are

More information

Robust model selection criteria for robust S and LT S estimators

Robust model selection criteria for robust S and LT S estimators Hacettepe Journal of Mathematics and Statistics Volume 45 (1) (2016), 153 164 Robust model selection criteria for robust S and LT S estimators Meral Çetin Abstract Outliers and multi-collinearity often

More information

The Effect of a Single Point on Correlation and Slope

The Effect of a Single Point on Correlation and Slope Rochester Institute of Technology RIT Scholar Works Articles 1990 The Effect of a Single Point on Correlation and Slope David L. Farnsworth Rochester Institute of Technology This work is licensed under

More information

Lecture One: A Quick Review/Overview on Regular Linear Regression Models

Lecture One: A Quick Review/Overview on Regular Linear Regression Models Lecture One: A Quick Review/Overview on Regular Linear Regression Models Outline The topics to be covered include: Model Specification Estimation(LS estimators and MLEs) Hypothesis Testing and Model Diagnostics

More information

Lecture 6 Multiple Linear Regression, cont.

Lecture 6 Multiple Linear Regression, cont. Lecture 6 Multiple Linear Regression, cont. BIOST 515 January 22, 2004 BIOST 515, Lecture 6 Testing general linear hypotheses Suppose we are interested in testing linear combinations of the regression

More information

LINEAR REGRESSION ANALYSIS. MODULE XVI Lecture Exercises

LINEAR REGRESSION ANALYSIS. MODULE XVI Lecture Exercises LINEAR REGRESSION ANALYSIS MODULE XVI Lecture - 44 Exercises Dr. Shalabh Department of Mathematics and Statistics Indian Institute of Technology Kanpur Exercise 1 The following data has been obtained on

More information

3. For a given dataset and linear model, what do you think is true about least squares estimates? Is Ŷ always unique? Yes. Is ˆβ always unique? No.

3. For a given dataset and linear model, what do you think is true about least squares estimates? Is Ŷ always unique? Yes. Is ˆβ always unique? No. 7. LEAST SQUARES ESTIMATION 1 EXERCISE: Least-Squares Estimation and Uniqueness of Estimates 1. For n real numbers a 1,...,a n, what value of a minimizes the sum of squared distances from a to each of

More information

Multicollinearity and A Ridge Parameter Estimation Approach

Multicollinearity and A Ridge Parameter Estimation Approach Journal of Modern Applied Statistical Methods Volume 15 Issue Article 5 11-1-016 Multicollinearity and A Ridge Parameter Estimation Approach Ghadban Khalaf King Khalid University, albadran50@yahoo.com

More information

((n r) 1) (r 1) ε 1 ε 2. X Z β+

((n r) 1) (r 1) ε 1 ε 2. X Z β+ Bringing Order to Outlier Diagnostics in Regression Models D.R.JensenandD.E.Ramirez Virginia Polytechnic Institute and State University and University of Virginia der@virginia.edu http://www.math.virginia.edu/

More information

Inference in Regression Analysis

Inference in Regression Analysis Inference in Regression Analysis Dr. Frank Wood Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 4, Slide 1 Today: Normal Error Regression Model Y i = β 0 + β 1 X i + ǫ i Y i value

More information

STATISTICS 110/201 PRACTICE FINAL EXAM

STATISTICS 110/201 PRACTICE FINAL EXAM STATISTICS 110/201 PRACTICE FINAL EXAM Questions 1 to 5: There is a downloadable Stata package that produces sequential sums of squares for regression. In other words, the SS is built up as each variable

More information

Peter Hoff Linear and multilinear models April 3, GLS for multivariate regression 5. 3 Covariance estimation for the GLM 8

Peter Hoff Linear and multilinear models April 3, GLS for multivariate regression 5. 3 Covariance estimation for the GLM 8 Contents 1 Linear model 1 2 GLS for multivariate regression 5 3 Covariance estimation for the GLM 8 4 Testing the GLH 11 A reference for some of this material can be found somewhere. 1 Linear model Recall

More information

Lecture 4: Regression Analysis

Lecture 4: Regression Analysis Lecture 4: Regression Analysis 1 Regression Regression is a multivariate analysis, i.e., we are interested in relationship between several variables. For corporate audience, it is sufficient to show correlation.

More information

On Modifications to Linking Variance Estimators in the Fay-Herriot Model that Induce Robustness

On Modifications to Linking Variance Estimators in the Fay-Herriot Model that Induce Robustness Statistics and Applications {ISSN 2452-7395 (online)} Volume 16 No. 1, 2018 (New Series), pp 289-303 On Modifications to Linking Variance Estimators in the Fay-Herriot Model that Induce Robustness Snigdhansu

More information

AMS 315/576 Lecture Notes. Chapter 11. Simple Linear Regression

AMS 315/576 Lecture Notes. Chapter 11. Simple Linear Regression AMS 315/576 Lecture Notes Chapter 11. Simple Linear Regression 11.1 Motivation A restaurant opening on a reservations-only basis would like to use the number of advance reservations x to predict the number

More information

Regression Diagnostics

Regression Diagnostics Diag 1 / 78 Regression Diagnostics Paul E. Johnson 1 2 1 Department of Political Science 2 Center for Research Methods and Data Analysis, University of Kansas 2015 Diag 2 / 78 Outline 1 Introduction 2

More information

A Note on UMPI F Tests

A Note on UMPI F Tests A Note on UMPI F Tests Ronald Christensen Professor of Statistics Department of Mathematics and Statistics University of New Mexico May 22, 2015 Abstract We examine the transformations necessary for establishing

More information

Chapter 14. Linear least squares

Chapter 14. Linear least squares Serik Sagitov, Chalmers and GU, March 5, 2018 Chapter 14 Linear least squares 1 Simple linear regression model A linear model for the random response Y = Y (x) to an independent variable X = x For a given

More information

Linear Models 1. Isfahan University of Technology Fall Semester, 2014

Linear Models 1. Isfahan University of Technology Fall Semester, 2014 Linear Models 1 Isfahan University of Technology Fall Semester, 2014 References: [1] G. A. F., Seber and A. J. Lee (2003). Linear Regression Analysis (2nd ed.). Hoboken, NJ: Wiley. [2] A. C. Rencher and

More information

Using Ridge Least Median Squares to Estimate the Parameter by Solving Multicollinearity and Outliers Problems

Using Ridge Least Median Squares to Estimate the Parameter by Solving Multicollinearity and Outliers Problems Modern Applied Science; Vol. 9, No. ; 05 ISSN 9-844 E-ISSN 9-85 Published by Canadian Center of Science and Education Using Ridge Least Median Squares to Estimate the Parameter by Solving Multicollinearity

More information

STAT5044: Regression and Anova. Inyoung Kim

STAT5044: Regression and Anova. Inyoung Kim STAT5044: Regression and Anova Inyoung Kim 2 / 51 Outline 1 Matrix Expression 2 Linear and quadratic forms 3 Properties of quadratic form 4 Properties of estimates 5 Distributional properties 3 / 51 Matrix

More information

Introduction to Statistical modeling: handout for Math 489/583

Introduction to Statistical modeling: handout for Math 489/583 Introduction to Statistical modeling: handout for Math 489/583 Statistical modeling occurs when we are trying to model some data using statistical tools. From the start, we recognize that no model is perfect

More information

Prediction of Bike Rental using Model Reuse Strategy

Prediction of Bike Rental using Model Reuse Strategy Prediction of Bike Rental using Model Reuse Strategy Arun Bala Subramaniyan and Rong Pan School of Computing, Informatics, Decision Systems Engineering, Arizona State University, Tempe, USA. {bsarun, rong.pan}@asu.edu

More information

Unit 10: Simple Linear Regression and Correlation

Unit 10: Simple Linear Regression and Correlation Unit 10: Simple Linear Regression and Correlation Statistics 571: Statistical Methods Ramón V. León 6/28/2004 Unit 10 - Stat 571 - Ramón V. León 1 Introductory Remarks Regression analysis is a method for

More information

Business Statistics. Tommaso Proietti. Linear Regression. DEF - Università di Roma 'Tor Vergata'

Business Statistics. Tommaso Proietti. Linear Regression. DEF - Università di Roma 'Tor Vergata' Business Statistics Tommaso Proietti DEF - Università di Roma 'Tor Vergata' Linear Regression Specication Let Y be a univariate quantitative response variable. We model Y as follows: Y = f(x) + ε where

More information

Regression Model Specification in R/Splus and Model Diagnostics. Daniel B. Carr

Regression Model Specification in R/Splus and Model Diagnostics. Daniel B. Carr Regression Model Specification in R/Splus and Model Diagnostics By Daniel B. Carr Note 1: See 10 for a summary of diagnostics 2: Books have been written on model diagnostics. These discuss diagnostics

More information

Box-Cox Transformations

Box-Cox Transformations Box-Cox Transformations Revised: 10/10/2017 Summary... 1 Data Input... 3 Analysis Summary... 3 Analysis Options... 5 Plot of Fitted Model... 6 MSE Comparison Plot... 8 MSE Comparison Table... 9 Skewness

More information

Linear Regression. September 27, Chapter 3. Chapter 3 September 27, / 77

Linear Regression. September 27, Chapter 3. Chapter 3 September 27, / 77 Linear Regression Chapter 3 September 27, 2016 Chapter 3 September 27, 2016 1 / 77 1 3.1. Simple linear regression 2 3.2 Multiple linear regression 3 3.3. The least squares estimation 4 3.4. The statistical

More information

Practical High Breakdown Regression

Practical High Breakdown Regression Practical High Breakdown Regression David J. Olive and Douglas M. Hawkins Southern Illinois University and University of Minnesota February 8, 2011 Abstract This paper shows that practical high breakdown

More information

Applied linear statistical models: An overview

Applied linear statistical models: An overview Applied linear statistical models: An overview Gunnar Stefansson 1 Dept. of Mathematics Univ. Iceland August 27, 2010 Outline Some basics Course: Applied linear statistical models This lecture: A description

More information

Remedial Measures for Multiple Linear Regression Models

Remedial Measures for Multiple Linear Regression Models Remedial Measures for Multiple Linear Regression Models Yang Feng http://www.stat.columbia.edu/~yangfeng Yang Feng (Columbia University) Remedial Measures for Multiple Linear Regression Models 1 / 25 Outline

More information

Multicollinearity occurs when two or more predictors in the model are correlated and provide redundant information about the response.

Multicollinearity occurs when two or more predictors in the model are correlated and provide redundant information about the response. Multicollinearity Read Section 7.5 in textbook. Multicollinearity occurs when two or more predictors in the model are correlated and provide redundant information about the response. Example of multicollinear

More information

Diagnostic Procedures

Diagnostic Procedures Diagnostic Procedures Joseph W. McKean Western Michigan University Simon J. Sheather Texas A&M University Abstract Diagnostic procedures are used to check the quality of a fit of a model, to verify the

More information

Applied Regression Analysis

Applied Regression Analysis Applied Regression Analysis Chapter 3 Multiple Linear Regression Hongcheng Li April, 6, 2013 Recall simple linear regression 1 Recall simple linear regression 2 Parameter Estimation 3 Interpretations of

More information

Wiley. Methods and Applications of Linear Models. Regression and the Analysis. of Variance. Third Edition. Ishpeming, Michigan RONALD R.

Wiley. Methods and Applications of Linear Models. Regression and the Analysis. of Variance. Third Edition. Ishpeming, Michigan RONALD R. Methods and Applications of Linear Models Regression and the Analysis of Variance Third Edition RONALD R. HOCKING PenHock Statistical Consultants Ishpeming, Michigan Wiley Contents Preface to the Third

More information

Dr. Junchao Xia Center of Biophysics and Computational Biology. Fall /1/2016 1/46

Dr. Junchao Xia Center of Biophysics and Computational Biology. Fall /1/2016 1/46 BIO5312 Biostatistics Lecture 10:Regression and Correlation Methods Dr. Junchao Xia Center of Biophysics and Computational Biology Fall 2016 11/1/2016 1/46 Outline In this lecture, we will discuss topics

More information

Variable Selection and Model Building

Variable Selection and Model Building LINEAR REGRESSION ANALYSIS MODULE XIII Lecture - 37 Variable Selection and Model Building Dr. Shalabh Department of Mathematics and Statistics Indian Institute of Technology Kanpur The complete regression

More information

Regression Analysis V... More Model Building: Including Qualitative Predictors, Model Searching, Model "Checking"/Diagnostics

Regression Analysis V... More Model Building: Including Qualitative Predictors, Model Searching, Model Checking/Diagnostics Regression Analysis V... More Model Building: Including Qualitative Predictors, Model Searching, Model "Checking"/Diagnostics The session is a continuation of a version of Section 11.3 of MMD&S. It concerns

More information