A CONNECTION BETWEEN LOCAL AND DELETION INFLUENCE

Size: px

Start display at page:

Download "A CONNECTION BETWEEN LOCAL AND DELETION INFLUENCE"

Ambrose Hampton
5 years ago
Views:

1 Sankhyā : The Indian Journal of Statistics 2000, Volume 62, Series A, Pt. 1, pp A CONNECTION BETWEEN LOCAL AND DELETION INFLUENCE By M. MERCEDES SUÁREZ RANCEL and MIGUEL A. GONZÁLEZ SIERRA University of La Laguna, Spain SUMMARY. In this paper we show that the influence approach of Hadi (1992) for overall potential influence of a subset of observations, is closely related to a variant of Cook s (1986) absolute measure of local influence. This result indicates that the structure of the local influence concept is quite useful for identifying influential subset, and provides a further justification for local influence analysis. 1. Introduction The method of local influence was introduced by Cook (1986) and modified by Billor and Loynes (1993) as a general tool for assessing the influence of local departures from the assumptions underlying the statistical models. The work of some authors (Lawrance (1988), Peña and Yohai (1995)) indicates that one of the attractions of the local influence concept is that it assesses the effect of joint perturbations on the data cases, more easily than the global influence. Thus, frequently, in a local sense the results are free from masking effects that present difficulties to individual case-deletion methods. This article shows that local-influence analysis of perturbations of the variance is similar to the usual regression diagnostic based on Hadi s measure (1992) for detecting influential subset. Section 2 gives the general idea of local influence. Section 3 describes the Hadi s measure. Section 4 shows that local-influence analysis of perturbations of the variance is similar to the usual regression diagnostic based on Hadi s measure for detecting influential subset, Section 5 provides illustrative examples. Paper received. March 1998; revised January AMS (1991) subject classification. 62J20. Key words and phrases. Case deletion, Hadi s measure, local influence, masking, regression, swamping.

2 a connection between local and deletion influence Local Influence Consider the standard linear regression model: Y = Xβ + ɛ,... (1) where ɛ is an n 1 vector whose elements are assumed to be independent normal random variables with mean zero and known variance σ 2, X is a known n k matrix with full column rank, β is a k 1 vector of parameters and Y is an n 1 vector of response variables. Collectively, the i-th observation y i on the response variable in combination with the associated values for the explanatory variables will be referred to as the i-th case. Many measures have been suggested to assess influence of observations in regression modeling. Chatterjee and Hadi (1986) gave an excellent review on this subject. Cook (1986) considered a general version of Cook s distance 2 Ŷ Ŷ(i) D i = kσ 2,... (2) where Ŷ, Ŷ(i) are the n 1 vectors of fitted values based on the full data and the data without the i-th case, respectively, and k is the dimension of β. He investigated 2 Ŷ Ŷ(w) D i (w) = kσ 2,... (3) where Ŷ(w) is the vector of fitted values obtained when the i-th case has weight w and the remaining cases have weight 1. These ideas have been extended to general models. This extension is partially motivated by the following relationship between D i (w) and the log-likelihood L(β) for model (1), kd i (w) = [ Y Ŷ(w) 2 Y Ŷ 2 ] σ 2 = 2[L( ˆβ) L( ˆβ w )], where ˆβ = ˆβ w=1 and ˆβ w is the maximum likelihood estimator of β when the i-th case has weight w. The form of this relationship is a consequence of the statistical structure assumed for the errors in model (1). The log-likelihood for the unperturbed and perturbed models are denoted by L(θ) and L(θ w), respectively. Then the likelihood displacement LD(w) is defined by LD(w) = 2[L(ˆθ) L(ˆθ w )],... (4) where ˆθ and ˆθ w are the maximum likelihood estimators of θ under the unperturbed and perturbed models respectively. The vector of the values w and LD(w) forms the surface of interest as w varies over certain space. The direction h max of maximum curvature of the likelihood displacement surface in the postulated model (where

3 146 m. mercedes suárez rancel and miguel a. gonzález sierra w = w 0 ) indicates the greatest local sensitivity against perturbations. The direction of maximum curvature is used as the main diagnostic tool in the local influence method. Billor and Lyones (1993) show some practical and theoretical difficulties which arise in Cook s approach. For example, computability of the maximum curvature is restricted to the linear regression model; lack of invariance of the curvature under reparametrisation of the perturbation scheme; and lack of definition of the parameters. To avoid these difficulties Billor and Loynes (1993) suggest, an alternative likelihood displacement: LD (w) = 2[L(θ) L(ˆθ w w)],... (5) where L(ˆθ w w) is the log-likelihood of the perturbed model, while Cook (1986), uses only the perturbation in the estimation of the parameters. Billor and Loynes (1993) suggest that the first derivative of LD provides valuable information about the local behavior of LD, so they use the direction which produces the maximum increment in LD with the following slope: If we take the (perturbed) model : l max = LD(w 0 ) = 2 L(ˆθ w). Y = Xβ + ɛ... (1a) where var(ɛ) = σ 2 W 1 with W = diag(1, 1,..., 1 + w i, 1,..., 1) then: l i = l max,i = (1 e2 i ).... (6) σ2 3. Hadi s Influence Measure Hadi (1992) proposes a measure for detecting influential subset of observations which is resistant to masking and swamping effects. This measure is based on the simple fact that potentially influential observations are outliers in either the X-space, the Y-space, or both, which yields the overall influence measure as: Hi 2 k d 2 i = (1 p ii ) (1 d 2 i ) + p ii, i = 1, 2,..., n... (7) 1 p ii where d 2 1 = e 2 i /e e, is the square of the i-th normalized residual and p ii is the i-th diagonal element of P = X(X X) 1 X Y. The diagnostic measure (7) is the sum of two components each of which has a nice interpretation. A large value of the first term on the right hand side of (7) indicates that the model has a poor fit (a large prediction error) and a large value of the second term indicates the presence of an outlier in the X space.

4 a connection between local and deletion influence A Connection Between Local And Deletion Influence In this section, we show that local-influence analysis of perturbations of the variance is similar to the usual regression diagnostic based on the Hadi s measure for detecting influential subset. To see that we propose a new measure based on the following likelihood displacement: LD (i) (w) = 2[L(θ) L (i) (ˆθ wi w i )] where L (i) (ˆθ wi ) is the log-likelihood displacement under the perturbed model when the i-th observation is deleted. If we apply this likelihood displacement to the perturbed model (1a) we have: L (i) (ˆθ wi w i ) = 1 2 ln 2π 1 2 ln[ ˆσ 2 (i) ] w i 2ˆσ (i) 2 (y i x ˆβ i (i) ) 2 (1 + w i ) 2. Then the slope of the maximum rate of increase in LD (i) (w i ) is given by: e2 i(i) l i(i) = l max(i) = 1 e (i) e (i) d 2 i = 1 (1 p ii )(1 d 2 k 1),... (8) i )(n where d 2 i = e 2 i /e e. Thereby, to try to control the influence of the high-leverage observations in (8), we propose a quasi likelihood displacement : LD (i) (w i ) = 2[L(ˆθ) L (i) (ˆθ wi w i )] + [var(ŷi) var(ŷw i )]. So that, the slope of the maximum increment direction of LD (i) (w i ) is: l i(i) = l max,i(i) = 1 d 2 i (1 p ii )(1 d 2 i )(n k 1) ˆσ2 p ii (1 p ii ) 2. Since p ii = k while Σd 2 i = 1, multiplying the second term by kˆσ2 prevents from being dominated by its third term: kd 2 i l i(i) = l max,i(i) = 1 e e[ (1 p ii )(1 d 2 i ) + p ii ].... (9) (1 p ii ) 2 Thus the measures (9) has an expression similar to Hadi s measure defined on (7), indicating an existing relation between local and deletion diagnostic. 5. Example 5.1 Scottish Hill Races Data. As a first numerical illustration, consider the Scottish Hill Races data reported in Atkinson (1986) and Lawrance (1989). The

5 148 m. mercedes suárez rancel and miguel a. gonzález sierra data give the record times (in seconds) of 35 Scottish Hill races in 1984 along with two explanatory variables, distance of race (in miles) and climb (in feet). The following model is fit to the data: T ime = β 0 + β 1 Distance + β 2 Climb + ɛ.... (10) The data contain two clear outliers, observations 18 with r i = 4.6 and 7 with r i = 2.8. For comparison purpose, we remove these two observations and refit model (10) to the remaining 33 races. As seen in Table 1, the two largest cases based on Hi 2 are equal to those based on l i(i), while the Cook s measure (C2 i ) and the Billor and Loynes measure l i highlight different observations. Table 1: New York rivers data: Two largest value based on H 2 i, l i(i), (C 2 i ) and l i. Case H 2 i Case l i(i) Case C 2 i Case l i , , House Price Data. Brant (1986) considers the house price data determining the selling price of houses (Weisberg, 1985) to illustrate the masking effect. We have n = 27 and k = 10 in these data. We apply Hi 2, l i(i), C2 i and l i, to these data and obtain 4 largest values in Table 2. Hi 2, l i(i) and C2 i give the same results, however, l i shows very different pattern. Table 2. House Price Data: Four largest value based on H 2 1, l i(i), (C 2 i ) and l i. Case H 2 i Case l i(i) Case C2 i Case l i Acknowledgments. The authors are thanks Ali S. Hadi and Robert Loynes for their helpful comments on an earlier version of this manuscript. References Atkinson, A.C. (1986). Comment: Aspects of diagnostic regression analysis, A comment on Influential observations, high leverage points, and outliers in linear regression by S. Chatterjee and A. S. Hadi. Statistical Science, 1 (3) Billor, N., and Loynes, R. M. (1993). Local Influence: A New Approach. Comm. Statist.- Theory Meth., 22, Brant, R. (1986). Finding and understanding influential sets in regression. Technical Report #466, School of Statistics, University of Minnesota. Chatterjee, S. and A.S. Hadi (1986). Influential observations, high leverage points, and outliers in linear regression. Statistical Science, 1 (3)

6 a connection between local and deletion influence 149 Cook, R. D. (1977). Detection of Influential Observations in Linear Regression, Technometrics, 19, (1986). Assessment of Local Influence (with discussion). Jour. Royal Statist. Soc., Ser. B., 48, Hadi, A.S. (1992). A New Measure of Overall Potential Influence in Linear Regression. Computational Statistics & Data Analysis, 14, Lawrance, A.J. (1988). Regression Transformation Diagnostic Using Local Influence. Jour. Amer. Statist. Assoc., 83, (1989). Local and deletion influence. Unpublished manuscript. Peña, D. and Yohai, V.J. (1995). The Detection of Influential Subsets In Linear Regression By Using An Influence Matrix. Jour. Royal Statist. Soc. Series B-Methodological, 57, Weisberg, S. (1985). Applied Linear Regression, 2nd Ed.. Wiley, New York. M. Mercedes Suaŕez Rancel and Miguel A. Gonzaĺez Sierra Department of Statistics Research Operation and Computation Faculty of Mathematics University of La Laguna Tenerife Spain msuarez@ull.es / magsierr@ull.es

UNIVERSITY OF MASSACHUSETTS. Department of Mathematics and Statistics. Basic Exam - Applied Statistics. Tuesday, January 17, 2017

UNIVERSITY OF MASSACHUSETTS. Department of Mathematics and Statistics. Basic Exam - Applied Statistics. Tuesday, January 17, 2017 UNIVERSITY OF MASSACHUSETTS Department of Mathematics and Statistics Basic Exam - Applied Statistics Tuesday, January 17, 2017 Work all problems 60 points are needed to pass at the Masters Level and 75