Identifying and accounting for outliers and extreme response patterns in latent variable modelling

Size: px

Start display at page:

Download "Identifying and accounting for outliers and extreme response patterns in latent variable modelling"

Clarence Allen
5 years ago
Views:

1 Identifying and accounting for outliers and extreme response patterns in latent variable modelling Irini Moustaki Athens University of Economics and Business

2 Outline 1. Define the problem of outliers and atypical response patterns. 2. Treat aberrant response patterns within a LVM framework. Forward Search and Backward Search Mixture model Robust ML estimation 3. Applications

3 Response strategies 1. Main response strategy: Responses to educational and psychological tests are supposed to be based on latent constructs. The strategy is used by the majority of the respondents and can be modelled using an IRT model. 2. Alternative response strategy: Based on guessing and randomness. Secondary strategies are modelled as separate latent classes. Hybrid model for binary responses (Yamamoto, 1989) and for ordinal responses (Nelson, 1999).

4 Secondary response strategies - Outliers Categorical responses 1. The response style agreement: tendency to agree/disagree with all questions regardless their content. 2. The extreme response style: tendency for some individuals to use the extreme ends of a Likert type response scale (e.g. strongly agree or strongly disagree) 3. The neutral response style: tendency to choose the middle alternative in a Likert type scale. Continuous responses: a response more than 3 standard deviations away from the mean is considered an unexpected response under the normal model.

5 Aberrant response patterns are identified using person fit statistics, residuals, or latent class analysis as long as an expected response-type style is adopted as a benchmark (e.g. Guttman scale). Those response styles should either be seen as outliers and therefore controlled or removed from the estimation process or they should be treated as a manifestation of certain respondents characteristics. In the literature, interest lies on how response styles differentiate among groups defined by demographic and socio-economic variables.

6 Searching for outliers and extreme response patterns ML estimation assumes that the responses are exactly generated by the true model. Proposed a robust estimator that does not break down when outliers are present in the data set for the latent variable model with mixed binary and continuous responses (Moustaki and Victoria-Feser, 2006, JASA). Implement algorithms such as the forward search (Hadi 1992, Atkinson 1994) for identifying outliers (Mavridis, PhD thesis). Use the latent variable model to accommodate for outliers in the data. Predict and account for the proportion of individuals that guessed a response pattern (Martin Knott).

7 Theoretical Framework for LVM Bartholomew and Knott (1999) A LVM, models the associations among a set of p observed variables (y 1, y 2,..., y p ) using q latent variables (z 1, z 2,..., z q ) where q is much less than p As only y can be observed any inference must be based on the joint distribution of y: f(y) = g(y z)φ(z)dz R z R z φ(z): prior distribution of z g(y z): conditional distribution of y given z. What we want to know: φ(z y)

8 If correlations among the y s can be explained by a set of latent variables then when all z s are accounted for the y s will be independent (local independence). q must be chosen so that: g(y z) = p g(y i z) i=1 The question is whether f(y) admit the presentation: f(y) = for some small value of q. R z R z p g(y i z)h(z)dz i=1

9 Identifying aberrant responses in factor analysis for continuous responses y i = µ i + q λ ij z j + e i i = 1,..., p (1) j=1 z is called the common factor since it is common to all y i s. The e i s were sometimes called specific since they are unique to a particular y i.

10 Cov(e,z) = 0 y = µ + Λz + e e N(0, Ψ) We also assume that e 1, e 2,..., e p are independent so that y 1, y 2,..., y p are conditionally independent given z. Then we can make some deductions about the distribution of the ys and, in particular, about their covariances and correlations. We can choose the scale and origin of z as we please because this does not affect the form of the regression equation. All parameter estimation techniques minimize some function of the distance between the sample covariance matrix S and the covariance matrix under the model, Σ(θ)

11 Methods for identifying outliers Model-free methods: graphs, distance measures: D(m) = (y m T(y)) C 1 y (y m T(y)), m = 1..., n T(y) is known as the location parameter while C y is known as the dispersion parameter. Mahalanobis distance. Substituting T(y) with the arithmetic mean and C y with the sample variance covariance matrix. MD(m) = (y m ȳ) S 1 y (y m ȳ), m = 1..., n ȳ and S y are sensitive to the presence of outliers.

12 Multivariate location and dispersion are not robust and the breakdown point of this estimator is 0%. Yuan and Bentler (2001) and Moustaki and Victoria-Feser (2006) show theoretically and via a simulation study the sample covariance matrix has unbounded influence function and zero breakdown point. Robust estimators for the location and dispersion parameter have been proposed (Gnanadesikan and Kettenring, 1972, Campbell, 1980, Rousseeuw, 1985 and Rousseeuw and Vanzomeren, 1990). Robust estimates for FA model. Use a robust sample covariance matrix (Filzmoser, 1999 and Pison et. 2003) Examination of residuals

13 Detection errors 1. Outliers may be responsible for improper solutions 2. Swamping effect: an observation is considered as an outlier when it is not. 3. Masking effect: an outlier is not detected (cluster of outliers).

14 Backward Search Hadi (1992), Hadi and Simonoff (1993), Atkinson (1994), Atkinson and Riani (2000), Atkinson, Riani and Cerioli (2004) 1. Start with the whole data set 2. Compute a measure that is intended to measure outlyingness 3. The observation which seems to be the most extreme, according to the criterion used, is deleted. 4. This process is iterated until a certain number of observations is excluded.

15 Forward Search 1. Start with an initial outlier free subset of the data (basic set) of size g. 2. Order the observations in the non-basic set according to their closeness to the basic set. 3. Add the least aberrant observation to the initial subset 4. Repeat the process until all observations are included. 5. Monitoring the search.

16 Choice of the initial subset Usually innumerable subsets of size g: (C n g ) If there are t outliers then the probability of selecting a clean initial sample is : C t 0 Cn t g C n g 1. Draw J sub-samples of size g, where g > p(p+1) 2 2. For each subsample, a factor analysis model is fitted and the model parameters, θ k = (λ k, Ψ k ), are estimated. k = 1..., J 3. For each subsample an objective function F(y, θ k ) is computed. Choices of objective functions: median of likelihood contributions, min(trace(s Σ(ˆθ k ))) 2 ), max(log-likelihood)

17 Progressing in the search 1. Order the observations in the non-basic set using the estimated parameters from the basic test. 2. Statistics used for the ordering: likelihood contributions, residuals, Mahalanobis distances 3. The next observation to enter the basic set is the one with the largest likelihood contribution or the smallest residual etc. 4. Stepwise procedure - units might interchange

18 Searching for sharp changes in: Parameter estimates t-tests Goodness-of-fit statistics Measures of fit Residuals Cook-type statistics Monitoring the search

19 General remarks In mostof the examples we have looked the BS and the FS pick up the same observations as outliers. BS is less time consuming the FS. Do outliers always enter in the last steps? In the example to follow none of the contaminated observations are excluded (masking effect) when the BS was used. Analysis of residuals did not identify the contaminated data.

20 Application Exam grades of 100 students on 5 exams. The first two exams are related to mathematics, the next two to literature and the last one is a comprehensive exam. one-factor model two-factor model ˆλ ˆΨ ˆλ1 ˆλ2 ˆΨ The asymptotic p-value for the LR-test is for the one-factor model and for the two-factor model

21 Contaminated the data set The first 80 individuals of the grades example after they have been sorted according to the FS were selected to represent the outlier-free part of the data The responses of 20 individuals are generated from a multivariate normal distribution, N (µ, I) where µ N(µ 80, 15 2 ). The contaminated observations are labelled in the data set from 1 to 20.

22 Forward Search and results A FS was conducted using likelihood contributions for adding new observations to the basic set. The initial sample was selected by examining 1000 subsets of size 16 with the U LS criterion. None of the initial samples contained outliers. The search consists of 84 steps, the 20 contaminants did not enter in the last 20 steps but started entering at step 51 and the last entered at step 79.

23 Individuals Figure 1: Plot of residuals

24 LRS WLSC Swamping effect ULSC Psi Masking effect contaminated units Figure 2: Various plots of fit

25 Modelling Secondary Response Strategies Binary Data Responses with all 1 s are denoted with y e The (2 p 1) in number remaining patterns are denoted with yē. A pseudo item is used to indicate whether an extreme response pattern, y e, is guessed. The pseudo item is denoted with u and it takes the value 1 when a response pattern is guessed and 0 otherwise. Note that the pseudo item is not observed in the data since we do not know in advance the number of extreme response patterns that have been guessed.

26 Table 1: Response mechanism Guessing No Guessing Total (u = 1) (u = 0) Extreme Response (y e ) n e,g n e,ḡ n e Non-Extreme Response (yē) 0 nē,ḡ nē Total n g nḡ n Individuals who have given a non-extreme response pattern have not guessed [P(u = 1 yē) = 0)] and therefore P(u = 0 yē) = 1. Those individuals who guessed the probability of responding with an extreme response pattern is one (P(y e u = 1) = 1).

27 Modelling the guessing response mechanism We define the distributions of the responses to the items (y 1,..., y p ) and the unobserved guessing item u conditional on a single latent variable z. Under the assumption of conditional independence f(y e, u = 0 z) = [ p ] f yi (yi e z) f u (u = 0 z) (2) i=1 f(y e, u = 1 z) = f u (u = 1 z) (3) [ p ] f(yē, u = 0 z) = f yi (yēi z) f u (u = 0 z), (4) i=1 f(yē, u = 1 z) = 0 (5)

28 The density f yi (y i z) is the conditional distribution of each binary item taken to be the Bernoulli: f yi (y i z) = [π i (z)] y i [1 π(z)] 1 y i, i = 1,..., p (6) where π i (z) = P(y i = 1 z). The response probability π i (z) for the p observed items is modelled with the two-parameter logistic model. logitπ i (z) = α 0i + α 1i z, i = 1,..., p, (7) where the parameters α 0i and α 1i are the intercepts and factor loadings respectively.

29 The model for the pseudo item u becomes: logitπ u (z) = α 0,u + α 1,u z + s β j,u x s, (8) j=1 where β j,u are regression coefficients.

30 Estimation The complete likelihood for a random sample of size n is: l = n m=1 f(y m, u m, z m ) (9) with f(y m, u m, z m ). The only observed part is y m = (y 1m,..., y pm ), where both the pseudo item u m and the latent variable z m are not observed. For the maximization of the log-likelihood the E-M algorithm is used.

31 Therefore, the conditional expected values of the score functions are taken over the posterior distribution f(u,z y). For the extreme response patterns: E { } lnf(y e, u,z) y e α il = [y e i π i (z)]z l f(u = 0, z y e ), i = 1,..., p [0 πu (z)]z l f(u = 0, z y e )+ [1 πu (z)]z l f(u = 1, z y e )

32 For the non-extreme response patterns yē we have: E { } α lnf(yē, u,z) yē il = [y ē i π i (z)]z l f(u = 0, z yē), i = 1,..., p [0 πu (z)]z l f(u = 0, z yē) The integrals are approximated using Gauss-Hermite quadrature points z t and corresponding weights φ(z t ). Alternative approximations can be used (Monte Carlo, Laplace, Adaptive Quadrature Points)

33 Guessing response process is completely at random - a simple case When the odds of guessing does not depend on a latent variable or on covariates, the terms f u (u = 1 z) and f u (u = 0 z) become: and f u (u = 1 z) = η f u (u = 0 z) = 1 η, where η is the proportion of guessers out of the total sample size and it does not depend on the latent variable z. No model is fitted to the guessing/pseudo item.

34 Taken into account that in any given data set there is a fixed number of observed responses of type y e, the model estimation and updating of the number of response patterns are done as follows: 1. Select number of guesses at extreme say n g (y e ) where the proportion of guessed responses is then defined as: η = n g(y e ) n 2. Fit the one factor model to the remaining observations (n n g (y e )) where n is the total sample size.

35 3. Update the proportion of guessed extreme responses by computing the conditional probability of guessing given an extreme response pattern: P(u = 1 y e ) = η η + f(y e u = 0)(1 η) (10) where f(y e u = 0) is a conditional probability computed from the model. 4. The allocation to an extreme response pattern y e to guessers is: n gnew (y e ) = n gold (y e ) n gold (y e ) + f(y e )(n n gold (y e )) n e where n e is the total number of extreme response patterns. No guessers are allocated to the non-extreme group.

36 Model interpretation What do we expect when we adjust for guessed extreme patterns? Improve the fit (study goodness-of-fit measures) Interpret the factor loadings taking into account the existence of outliers Relate guessing mechanism to covariates and identify demographic groups that are more inclined to guess than other groups.

37 Workplace Industrial Relations Survey Please consider the most recent change involving the introduction of new plant, machinery and equipment. Were discussions or consultations of any of the type on this card held either about the introduction of the change or about the way it was to be implemented? 1. Informal discussion with individual workers. 2. Meetings with groups of workers. 3. Discussions in established joint consultative committee. 4. Discussions in specially constituted committee to consider the change. 5. Discussions with union representatives at the establishment. 6. Discussions with paid union officials from outside.

38 n = 1005 non-manual workers. The one-factor model gives G 2 = and X 2 = on 32 degrees of freedom. The fit on the univariate, bivariate and trivariate margins suggest that the bad fit is due to item 1.

39 Table 2: Chi-squared residuals greater than 3 for the second and (1,1,1) third order margins for the one-factor model, WIRS data Response Items O E O E (O E) 2 /E (0,0) 2, (0,1) 2, (1,0) 2, , , (1,1) 2, , , (1,1,1) 1, 2, ,2, , 2, , 2, , 3, , 4, , 3, , 4,

40 In the original data set (n = 1005) there were only three extreme response patterns (111111) with expected frequency under the one-factor model. There were also thirty response patterns of type (011111) with expected frequency For the purpose of our analysis, we took response patterns (011111), thirty in total, and changed them to (111111).

41 Maximum likelihood standardized loadings Item No-guessing Simple Guess Model Guess ˆα i LogL= ˆα i ˆα i

42 Observed and expected frequencies for extreme response patterns Response Observed Expected pattern frequency frequency One-factor model One factor with 1 26 guessing One-factor model with guessing model

43 Conclusions The model itself adjusts for guessed extreme response patterns. Extensions to other types of extreme responses. Link and compare this method with other methods available such as robust estimation and subset regression methods. The FS algorithm has been extended to binary and mixed type data.

Factor Analysis and Latent Structure of Categorical Data

Factor Analysis and Latent Structure of Categorical Data Irini Moustaki Athens University of Economics and Business Outline Objectives Factor analysis model Literature Approaches Item Response Theory Models