Diagnostics analysis for skew-normal linear regression models: applications to a quality of life dataset

Size: px

Start display at page:

Download "Diagnostics analysis for skew-normal linear regression models: applications to a quality of life dataset"

Cuthbert Cole
6 years ago
Views:

1 Submitted to the Brazilian Journal of Probability and Statistics Diagnostics analysis for skew-normal linear regression models: applications to a quality of life dataset Introduction Clécio da Silva Ferreira a, Filidor Vilca b and Heleno Bolfarine c a Universidade Federal de Juiz de Fora b Universidade Estadual de Campinas c Universidade de São Paulo Abstract. The skew-normal distribution has been used successfully in various statistical applications. The main purpose of this paper is to consider local influence analysis, which is recognized as an important step of data analysis. Motivated to simplify expressions of the conditional expectation of the complete-data log-likelihood function, used in the EM algorithm, diagnostic measures are derived from the case-deletion approach and the local influence approach inspired by Zhu et al. () and Zhu and Lee (). Finally, the results obtained are applied to a dataset from a study to evaluate quality of life (QOL) and to identify its associated factors in climacteric women with a history of breast cancer. Linear symmetrical models have been investigated by various authors. For example, Lange et al. (989) presented an approach to model Student-t distributions, while Galea-Rojas et al. (997) and Liu () discussed diagnostic methods for symmetrical linear regression models. Although the class of symmetric distributions is a better alternative than that of the normal distribution, it is not appropriate in situations where the sample distribution is asymmetrical. For example, Hill and Dixon (98) discussed and presented examples with asymmetrical structures. From a practical viewpoint, many authors have transformed variables to achieve normality, and in many cases their methods are satisfactory. However, Azzalini and Capitanio (999) pointed out some problems, mentioning, for example, that the transformed variables are more difficult to deal with and to interpret. A family of skew-normal distributions that accommodates practical values of skewness and kurtosis, and includes the normal distribution as a special case, was introduced by Azzalini (985). From then on, many authors have considered these distributions in different areas, such as economics, oceanography, engineering and biomedical sciences, among others. Azzalini (5) presented a discussion on skew-normal distributions with applications in regression models. Also, Keywords and phrases. Case-deletion; Local influence; Skew-normal distribution; Approach; EM algorithm.

2 C. S. Ferreira et al. Lachos et al. (7) considered an application of diagnostics analysis in linear mixed models, and in the context of errors-in-variables models some results can be found in Lachos et al. (8). Sahu et al. (3) proposed a new skew-normal distribution, equivalent to that of Azzalini (985), in which the EM algorithm used to obtain the maximum likelihood estimates is simpler in the M-step. Consequently, some of its properties and applications can be easily derived, such as diagnostics analysis based the conditional expectation of the complete-data log-likelihood; see Zhu and Lee () and Zhu et al. (). Some applications of the skew-normal of Sahu et al. (3) can be found in Azevedo et al. (), Lee and McLachlan (3) and Dagne (6), among others. After the model is fitted, diagnostics analysis is a key step in data analysis subsequent to parameter estimation. This procedure can be carried out by conducting an influence analysis to detect influential observations. There are two usual approaches to detect influential observations. The first is the case-deletion approach, which has received a great deal of attention due to the paper by Cook (977), which that presented an intuitively appealing method. Since then, the Cook distance or the likelihood distance have been applied to many statistical models. This approach allows assessment of the individual impact of cases in the estimation process (see, Cook and Weisberg, 98). The second approach, which is a general statistical technique used to assess the stability of the estimation outputs with respect to the model inputs, is the local influence approach, due to Cook (986). This method has received considerable attention recently in the statistical literature on regression models. Both approaches can be studied on the basis of the Q-function considered by Zhu et al. () () to introduce the case-deletion measures and by Zhu and Lee () to study local influence diagnostics as can be seen in Ferreira et al. (5), Massuia et al. (5), and Zeller et al. (4). Although several diagnostic studies on regression models have appeared in the literature, no application of local influence has been considered for regression models under the skew-normal distribution of Sahu et al. (3) with emphasis on the case-deletion approach and local influence. Based on the Q-function, which is closely related to the conditional expectation of the completedata log-likelihood in the E-step of the EM algorithm, we develop a study of local influence following the approach of Zhu and Lee (), which produces results very similar to those obtained from Cook s method, but with simpler application. We also develop methods to obtain case-deletion measures by using the method of Zhu et al. (). Following Sahu et al. (3), we say that an random variable Y has an univariate skew-normal distribution with location parameter µ, scale parameter σ and skewness parameter δ, if its

3 Diagnostics analysis for skew-normal linear regression models: applications to a quality of life dataset 3 probability density function (pdf) is given by f(y) = ( y µ σ +δ φ )Φ σ +δ ( δ σ ) y µ, (.) σ +δ where φ(.) and Φ(.) are the pdf and cumulative distribution function (cdf) of the N(,) distribution, respectively. This distribution is denoted by Y SN(µ,σ,δ). It is easy to see that for δ =, the pdf in (.) reduces to the normal distribution. The mean and variance of Y are ( E(Y) = µ+ π δ and Var(Y j) = σ + ) δ, (.) π respectively. The stochastic representation is given by Y d = µ+δ X +σx, where X and X are independent with distribution N(,). The notation d = means that both variables have the same distribution. The paper is organized as follows. Section introduces the skew-normal linear regression model and describes an EM algorithm for obtaining the maximum likelihood (ML) estimates. Section 3 provides a brief sketch of the case-deletion and local influence approaches for models with incomplete data, and also develops a method pertinent to linear regression models under the skew-normal distribution. Moreover, the generalized leverage is discussed. Section 4 illustrates the method with analysis of two examples involving life quality of women with breast cancer and simulation studies. Finally, some concluding remarks are made in Section 5. The skew-normal linear regression model In this section we present the linear regression models under the skew-normal distribution (SN- LRM). Specifically, we consider the linear regression model under the skew-normal distribution proposed by Sahu et al. (3), as follows: Y j = x j β +δu j +ε j, (.) where x j = (,x j,...,x jp ), β = (β,β,...,β p ), ε j N(,σ ) and U j HN(,), j =,...,n, all independent, where HN(,) denotes the standard half-normal distribution. So, Y j SN(x j β,σ,δ) with pdf give by ( yj x ( f(y j ;θ) = j β δ y σ +δ φ j x )Φ ) j β, (.) σ +δ σ σ +δ where φ(.) and Φ(.) are as in (.). The mean and variance of Y j are given by E(Y j ) = x j β + π δ and Var(Y j) = σ +( π )δ, (.3)

4 4 C. S. Ferreira et al. respectively. The log-likelihood function for θ = (β,σ,δ), given y,...,y n, is given by l(θ) = nlog n log(π) n log(σ +δ ) (σ +δ ) n (y j x j β) + j= n logφ(b j ), (.4) j= δ where B j = σ σ +δ (y j x j β). After some algebraic manipulations, the information matrix denoted by I F (θ) is given by I F (θ) = [I γψ ], γ,ψ = β,σ,δ, whose components are given by ] I ββ = [+ δ σ +δ σ a (δ/σ) X X, [ δ c I βσ = (σ +δ ) 3/ (σ +δ ) / σ +δ ] σ 3 a (δ,σ) X, [ σ δ ] I βδ = (σ +δ ) 3/ c (σ +δ ) / δa (δ,σ) +σa (δ/σ) X, [ I σ σ = n (σ +δ ) + (σ +δ ) δ ] 4σ 6 a (δ/σ), [ nδ I σ δ = (σ +δ ) (σ +δ ] ) σ a (δ/σ), [ ] n I δδ = (σ +δ ) δ +σ a (δ/σ), where a (δ,σ) = a (δ/σ) δ σ a (δ/σ) δ a σ (δ/σ), a (δ,σ) = δ σ a (δ/σ)+a (δ/σ), c = /π and a hk (x) = E[Z h W k Φ (xz)], with a (x) = c(x + ) / and a (x) = c(x + ) 3/, and X is the model matrix. The other values a ( ) and a ( ) are obtained through the approximation given in Rodríguez and Branco (7). Next, we present the Q-function as an alternative to the observed-data log-likelihood function, which is associated with the conditional expectation of the complete-data log-likelihood function in the EM algorithm. The use of the Q-function will be useful to detect influential observations in regression models when the errors follow a skew-normal distribution because it is significantly simpler. This idea is inspired by the works of Zhu and Lee () and Zhu et al. ().. The Q-function and the EM algorithm We implement the EM algorithm introduced by Dempster et al. (977), in which the M-step involves only the Q-function complete data, which in this case is computationally simple. The model (.) can be expressed by the hierarchical representation Y j U j = u j N(x j β +δu j,σ ) and U j HN(,). (.5) Hence U j Y j = y j HN ( δ σ +δ (y j x j β), σ σ +δ ). Now, letting y = (y,...,y n ) and u = (u,...,u n ), treating u as missing data, we have the complete-data log-likelihood function

5 Diagnostics analysis for skew-normal linear regression models: applications to a quality of life dataset 5 associated with y c = (y,u ) is given by l c (θ y c ) = n j= l c(θ y j,u j ), where (without the additive constant) l c (θ y j,u j ) = logσ σ (y j x j β) + δ σ u j(y j x j β) δ σ u j. Given the current estimate θ (k) = ( β (k), σ (k), δ (k) ) of θ at the kth iteration, the E-step calculates the conditional expectation of the complete-data log-likelihood function or simply the Q-function n Q(θ θ (k) [ ) = E l c (θ y j,u j ) y j,θ = θ (k)] n = Q j (θ θ (k) ), (.6) where j= Q j (θ θ (k) ) = C log σ (k) σ (k) ǫ j j= (k) δ(k) δ + (k)û(k) j ǫ (k) j (k) (k) σ σ (k)û j, with ǫ (k) j = y j x β(k) (k) j, and the quantities û j = E(U j y j,θ = θ (k) ) and û(k) j = E(Uj y j,θ = θ (k) ) are obtained from properties of the half-normal distribution û (k) j = û (k) j = δ (k) σ (k) + σ (k) j + δ (k) ǫ(k) σ (k) δ + ǫ σ (k) + δ (k) ( σ (k) + δ j (k) ) ( σ (k) + δ W Φ (B (k) (k) ) / j ), (.7) (k) (k) + δ(k) σ (k) ( σ (k) + δ (k) ) 3/ ǫ (k) j W Φ (B (k) j ), (.8) where B j is as in (.4), j =,...,n and W Φ (u) = φ(u)/φ(u). Thus, we have the following EM-algorithm: E-step: Given θ = θ (k), compute û (k) j and û(k) j for j =,...,n, using (.7) and (.8). M-step: Update θ (k+) by maximizing Q(θ θ (k) ) over θ, which leads to the following closed form expressions β (k+) = (X X) X y δ (k) (X X) X û (k), σ (k+) = (k) [ ǫ ǫ (k) δ (k) û (k) ǫ (k) + δ n n (k) δ (k+) = (k) û(k) ǫ n j=û(k) j, û (k) j j= where û (k) = (û (k),...,û(k) n ), ǫ (k) = ( ǫ (k),..., ǫ(k) n ) and X is the model matrix. ], 3 Diagnostics analysis Influence diagnostic techniques are used to identify anomalous observations that impact model fit or statistical inference for the assumed statistical model. In the literature, there are primarily two

6 6 C. S. Ferreira et al. approaches to detect influential observations. The case-deletion approach of Cook (977) is the most popular one for identifying influential observations. For assessing the impact of influential observations on parameter estimates, some metrics have been used to measure the distance between θ [j] and θ, such as the likelihood distance and Cook s distance. The second approach is a general statistical technique used to assess the stability of the estimation outputs with respect to the model inputs proposed by Cook (986). We study here the case-deletion measures and the local influence diagnostics for linear regression models on the basis of the Q-function inspired by the results of Zhu et al. (), Zhu and Lee () and Lee and Xu (4). We first consider the case-deletion measures, then the local influence measures based on some perturbation schemes and generalized leverage. 3. Case-deletion measures Case-deletion is a classic approach to study the effects of dropping the jth case from the dataset. Let y c = (y,u) be the augmented dataset, where a quantity with a subscript [j] denotes the original one with the jth observation deleted. The complete-data log-likelihood function based on the data with the jth case deleted is denoted by l c (θ y c[j] ). Let θ [j] = ( β [j], σ [j], δ [j] ) be the [ maximizer of the function Q [j] (θ θ) = E l c (θ Y c[j] ) y [j],θ = θ ], where θ = ( β, σ, δ) is the ML estimate of θ, and the estimates θ [j] are obtained by using the EM algorithm based on the remaining n observations, with θ as the initial value. To assess the influence of the ith case on θ, we compare θ [j] with θ. If θ [j] is far from θ in some sense, then the ith case is regarded as influential. As θ [j] is needed for every case, the required computational effort can be quite heavy, especially when the sample size is large. Hence, to calculate the case-deletion estimate θ [j] of θ, Zhu et al. () proposed the following one-step approximation based on the Q-function: θ [j] = θ + { Q( θ θ) } Q [j] ( θ θ), (3.) where Q( θ θ) = Q(θ θ)/ θ θ and Q θ= θ [j] ( θ θ) = Q [j] (θ θ)/ θ are the Hessian θ= θ matrix and the gradient vector evaluated at θ, respectively. The Hessian matrix is an essential element in the method developed by Zhu and Lee () to obtain the measures for casedeletion diagnosis. To develop the case-deletion measures, we have to obtain the elements in (3.), Q [j] ( θ θ) and Q( θ θ), which can be obtained quite easily from (.6). The vector Q [j] ( θ θ) =

7 Diagnostics analysis for skew-normal linear regression models: applications to a quality of life dataset 7 ( Q [j]β ( θ θ), Q [j]σ ( θ θ), Q [j]δ ( θ θ)) has its elements given by Q [j]β ( θ θ) = σ X [j] ( ǫ [j] δû [j] ), Q [j]σ ( θ θ) = σ [ n++ σ ( ǫ [j] ǫ [j] δû [j] ǫ [j] + δ Q [j]δ ( θ θ) = σ (û [j] ǫ [j] δ n i=,i j û i ). n i=,i j Moreover, the Hessian matrix Q( θ θ) and its inverse matrix are respectively given by Q( θ θ) = Q Q Q Q where K = Q Q Q Q, Q ( θ) = σ X X and Q ( θ θ) = n σ û i )], Q + K Q Q Q Q K Q Q Q K Q /K, Q ( θ) = σ X û and Q ( θ) = ṋ σ û, (3.) withû = (û,...,û n ) andû denoting the mean of the components of vectorû = (û,...,û n). Inspired by the classic case-deletion measures, Cook s distance and the likelihood displacement, Zhu et al. () and Lee and Xu (4) presented analogous measures based on the Q-function. These measures are: (i) Generalized Cook distance: This measure is defined similar to the usual Cook distance, which is based on the genuine likelihood GD j = ( θ [j] θ) { Q( θ θ) } ( θ [j] θ). (3.3) Upon substituting (3.) into (3.3), we obtain the approximate distance given by GD j = Q [j] ( θ θ) { Q( θ θ) } Q [j] ( θ θ); (ii) Q-distance: This measure of the influence of the ith case is based on the Q-distance function, similar to the likelihood distance LD j discussed by Cook (977), defined as QD j = { Q( θ θ) Q( θ [j] θ) }. (3.4) We can calculate an approximation of the likelihood displacement QD j by substituting (3.) into (3.4), resulting in the following approximation QD j of QD j: QD j = { Q( θ θ) Q( θ [j] θ) }.

8 8 C. S. Ferreira et al. If the interest is to consider the influence of the ith case on some subset of parameters, then it can be obtained quite easy as follows: Let θ = (θ,θ ), where here θ = (β,σ ) is the parameter of the usual regression model and θ = δ is the skewness parameter. From the expressions of Q [j] ( θ θ) and Q( θ θ), the generalized Cook distance for the parameters (β,σ ) and δ can be defined as follows: () Both β and σ are the parameters of interest and δ is the nuisance parameter GD j (β,σ ) = Q [j]θ ( θ θ) [I p+,] { Q( θ θ) } [Ip+,] Q [j]θ ( θ θ), (3.5) where Q [j]θ ( θ θ) = ( Q [j]β ( θ θ), Q [j]σ ( θ θ) ), with θ = (β,σ ) ; () δ is the parameter of interest and both β and σ are nuisance parameters GD j (δ) = Q [j]δ ( θ θ) b { Q( θ θ) } b Q [j]δ ( θ θ) (3.6) = ( Q [j]δ ( θ θ) ) b { Q( θ θ) } b, where Q [j]δ ( θ θ) is the third element of Q [j] ( θ θ) and b is the (p+) vector with one at (p+)th position. 3. The local influence approach Consider a perturbation vector ω varying in an open region Ω R q. Let l c (θ,ω y c ), θ R p be the complete-data log-likelihood of the perturbed model. We assume there is a ω such that l c (θ,ω y c ) = l c (θ y c ) for all θ. Let θω be the maximizer of the function Q(θ,ω θ) = E[l c (θ,ω y c ) y, θ]. Then the influence graph is defined as α(ω) = (ω,f Q (ω)), where f Q (ω) is the Q-displacement function defined as [ ) )] f Q (ω) = Q( θ θ Q( θω θ. Following the approach developed in Zhu and Lee (), the normal curvature C fq,d, of α(ω) at ω in the direction of a unit vector d can be used to summarize the local behavior of the Q-displacement function. It can be shown that } C fq,d(θ) = d Qω d = d ω { Q( θ θ) ω d where Q( θ θ) = Q(θ θ)/ θ θ and θ= θ ω = Q(θ,ω θ)/ θ ω. As in Cook { Q( θ θ)} θ= θω (986), the expression Qω = ω ω is fundamental to detect influential observations. It may be of interest to assess the influence on a subset θ of θ = (θ,θ ). For

9 Diagnostics analysis for skew-normal linear regression models: applications to a quality of life dataset 9 example, one may be interested in θ = β or θ = δ. In such situations, the curvature in the direction d is given by C fq,d(θ ) = d ω ( Q( θ θ) B ) ω d, (3.7) where B =, Q ( θ θ) and Q ( θ θ) is obtained from the partition of Q( θ θ) according to the partition of θ. A clear picture of Qω (a symmetric matrix) is given by its spectral decomposition Qω o = n λ k e k e k, k= where (λ,e ),...,(λ n,e n ) are the eigenvalue-eigenvector pairs of the matrix Qω o, with λ... λ q > λ q+ =... = λ n = and e,...,e n are elements of the associated orthonormal basis. Zhu and Lee () proposed to inspect all eigenvectors corresponding to nonzero eigenvalues for more revealing information, but this can be computationally intensive for large n. Following Zhu and Lee () and Lu and Song (6), we consider an aggregated contribution vector of all eigenvectors corresponding to nonzero eigenvalues. Starting with some notation, let λ k = λ k /(λ +...+λ q ), e k = (e k,...,e kn ) and M() = q λ k e k. k= Hence, the assessment of influential cases is based on {M() l,l =,...,n} and one can obtain M() l via B fq,u l = u l Qω u l /tr[ Qω ], where u l is a column vector in R n with the lth entry equal to one and all other entries zero. Refer to Zhu and Lee () for other theoretical properties of B fq,u l, such as invariance under reparameterization of θ. Additionally, Lee and Xu (4) propose to use /n +c SM() as a benchmark to regard the lth case as influential, where c is an arbitrary constant (depending on the real application) and SM() is the standard deviation of {M() l,l =,...,n}. Finally, following the idea of Verbeke and Molenberghs (), we consider another important direction given by d = d kq which corresponds to a q vector with one in the kth position and all other entries zero. In that case, the normal curvature is called the total local influence of kth observation, which is given by { Q( θ θ)} C k = d kq ω ω d kq, k =,...,q. (3.8)

10 C. S. Ferreira et al. Moreover, the kth term is an influential observation if C k is larger than the cutoff value c = q k= C k/q. Now, we are ready to evaluate the matrix ω under three different perturbation schemes: case weights perturbation, response perturbation and explanatory perturbation. Each perturbation scheme has the partitioned form of ω = ( β, σ, δ ). (i) Case weight perturbation Let ω = (ω,...,ω n ) be an n dimensional vector with ω = (,...,) R n. Then, consider the following arbitrary allocation of weights for the Q-function, which may capture departures in general directions Q(θ,ω θ) = n k= ω kq k (θ θ). In this case, the matrix ω is given by X D( ǫ) ω = σ σ ǫ D( ǫ) n where D(a) denotes the diagonal matrix of vector a. (ii) Response perturbation δx D(û) + σ δ σ ǫ D(û) δ ǫ D(û)+ δû σ û, (3.9) A perturbation of the response variables Y = (Y,...,Y n ) is introduced by replacing Y k by Y kω = Y k +ω k S y, where S y is the standard deviation of Y. The perturbed Q-function is given by Q(θ,ω θ), switching Y kω with Y k and ω = (ω,...,ω n ). Under this perturbation scheme the vector ω = and the matrix ω is expressed as (iii) Explanatory perturbation X ω = S y σ σ ǫ + S y σ δ û σ û. (3.) We are interested in perturbing a specific explanatory variable. Under this condition we have the perturbed explanatory variable x tω = x t + S t ω, t {,...,p}, where S t is the standard deviation of the explanatory variable x t and ω =. The perturbed Q-function is given by Q(θ,ω θ), switching x tω with x t. The (p+) n matrix ω is given by e p (t) ǫ ω = S β t X δe p (t)û t σ β t σ ǫ + S t β t σ σ δû, (3.) β t û

11 Diagnostics analysis for skew-normal linear regression models: applications to a quality of life dataset where β t is the tth element of β, e p (t) is a p vector with one in the tth position and zeros elsewhere. 3.3 Generalized leverage Another concept that has been useful in the development of diagnostics in linear regression is the leverage. The main idea behind the concept of leverage is that of evaluating the influence of y j on its own predicted value (see Wei et al., 998). This influence can be represented by the derivative ŷ j / y j. Under usual linear regression model, ŷ j / y j = h jj that is the jth principal diagonal element of the projection matrix H = X(X X) X. Inspired by the generalized leverage proposed by Wei et al. (998), we consider a generalized leverage matrix for models with incomplete data defined by GL( θ) = D θ [ Q( θ θ)] Qθ,y ( θ), (3.) where D θ = µ/ θ, Qθ,y ( θ) = Q(θ θ)/ θ y θ= θ and Q( θ θ) is the Hessian matrix. In the linear regression models under skew-normal distribution, the expectation of Y is given by E(Y ) = µ(θ) = Xβ+ π δ n, then Ŷ = µ( θ) is the predicted response vector. In this case, we have that D θ = [X, n, c n ], Qθ,y ( θ) is as in (3.) without the term S y and Q θ ( θ θ) is given in (3.). We usec p /n,p = tr(gl( θ)), as a benchmark to regard the jth case as a leverage point, where c is a selected constant (depending on the real application). So, observations with values GL jj ( θ) > c p /n are considered leverage points. 4 Numerical application In this section, a simulation study and a real example are presented to illustrate the performance of the developed method. First, we carry out a numerical illustration with simulated data. Finally, we analyze real data to illustrate the usefulness of the proposed method. 4. Simulation study In this section, we examine the performance of the developed method based on simulated data. We consider a dataset of size n =,5 from linear regression models defined in (4.). That is, y j = β +β x j +ǫ j, (4.) where the true values of the parameters are given by β = 5., β =., σ =. and δ =., where x j is generated from the uniform distribution U(,), j =,...,n. For each sample,

12 C. S. Ferreira et al. Table Simulated and adjusted values for model (4.), with n =, with standard errors in parentheses. Parameters Simulated Adjusted Model A Model B Model C β (.5) 5.(.5) 6.6(.5) β..(.).(.).9(.7) σ..9(.4).8(.4).4(.) δ..98(.).6(.) -.57(.6) Table Simulated and adjusted values for model (4.) with n = 5, with standard errors in parentheses. y Parameters Simulated Adjusted Model A Model B Model C β (.7) 4.98(.7) 6.4(.3) β..(.9).(.9).98(.) σ..(.3).9(.).46(.5) δ..(.6).8(.6) -.4(.4) x (a) x Figure Simulated data with n = (a) and n = 5 (b). Scatterplot with adjusted line for Models A (solid line), B (dashed line) and C (dashdot line). we choose an observation and disturb it in the response variable. So, we consider three linear regression models: Model A: model without perturbation; Model B: y B (ind) = a y(ind); Model C: y C (ind) = b y(ind), where a > and < b <. For n =, a =.4, b =.5 and ind = 58; for n = 5, a =.7, b =.5 and ind = 38. Table shows the maximum likelihood (ML) estimates for the y parameters of the three models. In Model B, the estimate of β is slightly lower than that of Model A, while the estimate of δ is slightly higher. However, as ŷ j = x β+ δ, j π the fitted line is very similar to that in Model A. In Model C, the differences of the estimates for the parameters β, σ and δ are more pronounced, but the fitted line is also similar to that in Model A, as can be seen in Figure (b) 38

13 Diagnostics analysis for skew-normal linear regression models: applications to a quality of life dataset M() 5 99 M() M() M() (a) (b) (c) M() M() (d) (e) (f) Figure Simulated data with n =. Model B: Case weight perturbation (a), Response perturbation (b) and Explanatory perturbation (c). Model C: Case weight perturbation (d), Response perturbation (e) and Explanatory perturbation (f). M() M() M() M() (a) (b) (c) 38 M() x M() x (d) (e) (f) Figure 3 Simulated data with n = 5. Model B: Case weight perturbation (a), Response perturbation (b) and Explanatory perturbation (c). Model C: Case weight perturbation (d), Response perturbation (e) and Explanatory perturbation (f). Local influence analysis under three perturbation schemes and generalized leverage measures for simulated data are described next. Forn =, Model A yielded observations #49 and #99 as influential in case weight perturbation, and #43, #49 and #99 under response and explanatory variables. Observations #, # and #3 were classified as leverage points; see Figure. Model

14 4 C. S. Ferreira et al. B indicated observations #58 and #99 are influential for the three perturbation schemes; see Figures (a)-(c). For the three perturbation schemes under Model C, only observation #58 is detected; see Figures (d)-(f). Following the same way, we have similar results for n = 5, with observation #38 as influential, as presented in Figures 3 (a)-(c) and 3 (e)-(f). The results do not change for large samples size, and are omitted. 4. Real dataset A dataset on life quality of women with breast cancer obtained by the Center for Integral Attention to Women s Health (State University of Campinas, Brazil) was adjusted to the skew-normal regression model and the diagnostic analysis approach was applied. The index of life quality was evaluated by the Medical Outcome Study 36-item Short-Form Health Survey (SF-36) questionnaire. These indexes condense eight components into two: a physical component summary and a mental component summary. Conde et al. (5) analyzed this dataset, evaluating the associated factors of the life quality of women with breast cancer. The response variable is the physical component summary of the life quality index (pcs) and the explanatory variables are the indicator variable dizziness and the body mass index (bmi) of the individual. In this study, we discusse these data based on the skew-normal linear regression model pcs j = β +β dizziness j +β bmi j +e j, j =,...,97, (4.) where e j are independent such that e j SN(,σ,δ). 4.. Goodness-of-fit and estimation: The ML estimates of parameters and their corresponding standard deviations (calculated using the observed information matrix) are reported in Table 3, which shows that the estimates of β in both models are similar. In contrast, the estimates of the parameter σ are different. We have constructed the QQ-plots and envelopes in Figure 4 (lines represent the 5th percentile, the mean and the 95th percentile of simulated points for each observation) based on rj = (y j x j β), which has an approximate χ distribution. The random σ + δ variable r j also enables us to check the model and detect the presence of outlying observations. The simulated envelope graph plotted to validate the skew-normal regression model indicates no points lie outside the confidence bounds (Figure 4b). On the other hand, the simulated envelope for the normal linear regression model (Figure 4a) indicates some exterior points. Figure 5 displays the scatter plot of data and the adjusted lines for each status of dizziness, showing a negative relationship between pcs and bmi. Figure 5(b) shows the histogram of the

15 Diagnostics analysis for skew-normal linear regression models: applications to a quality of life dataset 5 Sample values and simulated envelope Sample values and simulated envelope Theoretical χ quantiles (a) Theoretical χ quantiles Figure 4 Simulated envelope for quality of life dataset. (a) Adjusted under normal linear model and (b) Adjusted under skew-normal linear model. Table 3 ML estimates of the life quality dataset for normal and skew-normal linear regression models. Parameter SN Normal Estimate SE Estimate SE β β β σ δ Log-likelihood AIC model waste set (4.), indicating the presence of asymmetric data. Based on the normal approximation of the ML estimates, the 95% confidence interval for δ (CI: δ±.96se) is [ 6.56, 7.4], which does not include the value zero. That is, a skew-normal model is more suitable than a normal linear model. A similar result is obtained when testing the null hypothesis H : δ = against the alternative H : δ when we use the likelihood-ratio (LR) statistic, which has a χ distribution under H. In this case, LR = (l( θ) l( θ )) = 3.88, with a p-value of Influence diagnostics analysis First, we identify outlying observations under the fitted model based on case-deletion measures, the Q-distance and generalized Cook distance. The results are reported in Figure 6. We note from these figures that observations #4 and #73 are potentially influential on the parameter estimates. These refer to women without dizziness with low life quality index. Observation #65 shows a slight influence and represents a woman with high life quality index. We also used the generalized Cook distance when the interest is a subset (b)

16 6 C. S. Ferreira et al With dizziness Without dizziness With dizziness Without dizziness pcs Density bmi (a) Pearson residuals Figure 5 Quality of life dataset. (a) Scatter plot and adjusted lines, (b) Pearson residuals of the parameters. All results were similar, so they are not shown here to save space. GD (a) 73.5 QD (b) Figure 6 Quality of life dataset. (a) Generalized Cook distance, (b) Q-Distance Second, we conducted the local influence diagnostics analysis to detect outlying observations, using the perturbation schemes discussed previously and the generalized leverage. For the perturbation schemes we obtained the values of M() and the figures present the index graphs of M(). The horizontal lines delimit the Lee and Xu (4) benchmark for M(), with c = 3. So, under the perturbation schemes, observations #, #4 and #73 are potentially influential on the parameter estimates. These results are reported in Figure 7 and Figure 8a. Note that observations #4 and #73 are potentially influential under all perturbation schemes, which correspond to the smallest values of pcs for women without dizziness. On the other hand, observation # is influential under case weight perturbation and response perturbation. This observation is a woman with smaller values of pcs and with dizziness. Moreover, we performed diagnostics analysis based on C i measures defined in (3.8), and the influential observations were the same observations #4 and #73. We then used the generalized leverage and observations #5, # (b) 65 73

17 Diagnostics analysis for skew-normal linear regression models: applications to a quality of life dataset M(). M() (a) Figure 7 Quality of life dataset. Diagnostics of case weight perturbation (a) and response variable perturbation (b) (benchmark with c = 3). M() (a) 73 GL (b) Figure 8 Quality of life dataset. (a) Diagnostic of bmi variable perturbation (benchmark with c = 3) and (b) Generalized leverage. and #3 were indicated as influential, as can be seen in Figure 8b. These observations are more distant from the mass data (right), which are women without dizziness and high bmi. It is important to note that these observations were not detected using the perturbation schemes, so the usefulness of the generalized leverage can be seen. (b)

18 8 C. S. Ferreira et al C fq (δ).5 C fq (δ) C fq (δ) C fq (β) C fq (β) (a) C fq (β) (c) 4 73 C i (δ) 9 x (b) C (β) i Figure 9 Quality of life dataset. Diagnostic of influence in β and δ (benchmark with c = 3): (a) Case weight perturbation (b) Response variable (c) bmi variable and (d) C i for case weight. Finally, we also discuss the influence of some observations on the estimates of β and δ based on C fq,d(β) and C fq,d(δ), as defined in (3.7). For the three perturbation schemes, Figures 9(a)-(c) show the scatter plots of C fq,d(β) versus C fq,d(δ). From these figures, we observe that #73 is potentially influential on both parameter estimates β and δ, regardless of the perturbation scheme. Moreover, Figure 9d shows that observation #3 is found to have relatively large C i (β) and C i (δ), which corresponds to the woman without dizziness and high bmi. 5 Conclusion In this paper, we have provided a diagnostics analysis for skew-normal regression models based on the approaches proposed by Zhu et al. () and Zhu and Lee (). We have proposed a procedure for computing case-deletion measures and local influence diagnostics on the basis of the conditional expectation of the complete-data log-likelihood function in relation to the EM algorithm and also we have considered a generalized leverage. Case weight, response and explanatory perturbations are considered. Explicit expressions are obtained for the Hessian matrix Q( θ θ) and for the matrix ω under different perturbation schemes. The generalized leverage devel- (d) 73 3

19 Diagnostics analysis for skew-normal linear regression models: applications to a quality of life dataset 9 oped here is a necessary supplement to the results obtained based on the perturbation schemes. This diagnostic measure helps to detect observations that were not detected by perturbation schemes. References Azevedo, C. L. N., Bolfarine, H., Andrade, D. F.,. Bayesian inference for a skew-normal irt model under the centred parameterization. Computational Statistics and Data Analysis 55, Azzalini, A., 985. A class of distributions which includes the normal ones. Scandinavian Journal of Statistics, Azzalini, A., 5. The skew-normal distribution and related multivariate families. Scandinavian Journal of Statistics 3, Azzalini, A., Capitanio, A., 999. Statistical applications of the multivariate skew-normal distribution. Journal of the Royal Statistical Society 6, Conde, D. M., Pinto-Neto, A., Cabello, C., Santos-Sá, D., Costa-Paiva, C., Martinez, E., 5. Quality of life in brazilian breast cancer survivors age years: associated factors. The Breast Journal, Cook, R. D., 977. Detection of influential observation in linear regression. Technometrics 9, 5 8. Cook, R. D., 986. Assessment of local influence. Journal of the Royal Statistical Society, Series B 48, Cook, R. D., Weisberg, S., 98. Residuals and influence in regression. Chapman & Hall/CRC, Boca Raton, FL. Dagne, G. A., 6. Bayesian segmental growth mixture tobit models with skew distributions. Comput Stat 3, 37. Dempster, A., Laird, N., Rubin, D., 977. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B 39 (), 38. Ferreira, C. S., Lachos, V. H., Bolfarine, H., 5. Inference and diagnostics in skew scale mixtures of normal regression models. Journal of Statistical Computation and Simulation 85 (3), Galea-Rojas, M., Paula, G. A., Bolfarine, H., 997. Local influence in elliptical linear regression models. The Statistician 46, Hill, M. A., Dixon, W. J., 98. Robustness in real life: a study of clinical laboratory data. Biometrics 38, Lachos, V. H., Bolfarine, H., Arellano-Valle, R. B., Montenegro, L. C., 7. Likelihood based inference for multivariate skew-normal regression models. Communications in Statistics Theory and Methods 36, Lachos, V. H., Montenegro, L. C., Bolfarine, H., 8. Inference and influence diagnostics for skew-normal null intercept measurement errors models. Journal of Statistical Computation and Simulation 78, Lange, K. L., Little, R., Taylor, J., 989. Robust statistical modeling using t distribution. Journal of the American Statistical Association 84, Lee, S. X., McLachlan, G. J., 3. On mixtures of skew normal and skew t-distributions. Adv Data Anal Classif 7, Lee, S. Y., Xu, L., 4. Influence analysis of nonlinear mixed-effects models. Computational Statistics and Data Analysis 45, Liu, S. Z.,. On local influence for elliptical linear models. Statistical Papers 4, 4.

20 C. S. Ferreira et al. Lu, B., Song, X.-Y., 6. Local influence of multivariate probit latent variable models. Journal of Multivariate Analysis 97 (8), Massuia, M. B., Cabral, C. R. B., Matos, L. A., Lachos, V. H., 5. Influence diagnostics for student-t censored linear regression models. Statistics: A Journal of Theoretical and Applied Statistics 49 (5), Rodríguez, C. L. B., Branco, M. D., 7. Bayesian inference for the skewness parameter of the scalar skew-normal distribution. Brazilian Journal of Probability and Statistics, Sahu, S. K., Dey, D. K., Branco, M. D., 3. A new class of multivariate distributions with applications to bayesian regression models. The Canadian Journal of Statistics 3, 9 5. Verbeke, G., Molenberghs, G.,. Linear mixed models for longitudinal data. Springer, New York. Wei, B. C., Qu, Y. Q., Fung, W. K., 998. Generalized leverage and its applications. Scandinavian Journal of Statistics 5, Zeller, C. B., Lachos, V. H., Vilca, F. V., 4. Influence diagnostics for grubbs s model with asymmetric heavytailed distributions. Stat Papers 55, Zhu, H., Lee, S.,. Local influence for incomplete-data models. Journal of the Royal Statistical Society, Series B 63, 6. Zhu, H., Lee, S., Wei, B., Zhou, J.,. Case-deletion measures for models with incomplete data. Biometrika 88 (3), Clécio da Silva Ferreira clecio.ferreira@ufjf.edu.br Heleno Bolfarine hbolfar@ime.usp.br Filidor Vilca fily@ime.unicamp.br

Local Influence and Residual Analysis in Heteroscedastic Symmetrical Linear Models

Local Influence and Residual Analysis in Heteroscedastic Symmetrical Linear Models Francisco José A. Cysneiros 1 1 Departamento de Estatística - CCEN, Universidade Federal de Pernambuco, Recife - PE 5079-50