Model Estimation Example

Size: px

Start display at page:

Download "Model Estimation Example"

Claribel Cannon
6 years ago
Views:

1 Ronald H. Heck 1 EDEP 606: Multivariate Methods (S2013) April 7, 2013 Model Estimation Example As we have moved through the course this semester, we have encountered the concept of model estimation. Discussions of various model estimation methods come up regularly in factor analysis, structural equation models, mixed (or multilevel) models, and generalized linear models (i.e., which are models for dichotomous, ordinal, multinomial, and count outcomes). Model estimation attempts to determine the extent to which a model-implied covariance (or correlation) matrix is a good approximation of the sample covariance matrix. In general, confirmation of a proposed model relies on the retention of the null hypothesis that is, that the data are consistent with the model hypothesized (Marcoulides & Hershberger, 1997). Failure to reject this null hypothesis implies that the proposed model is a plausible representation of the data, although it is important to note that it may not be the only plausible representation of the data. As Marcoulides and Hershberger (1997) note, evaluating the difference between the two covariance matrices based on the proposed model depends on the estimation method used to solve for the model s parameters [e.g., generalized least squares (GLS), maximum likelihood (ML), weighted least squares (WLS)]. Each approach proceeds iteratively to solve the modelimplied equations until an optimal solution for the model parameters is obtained (i.e., where the implied covariance matrix is close to the observed covariance matrix). The difference in the model-implied matrix and sample matrix is described as a discrepancy function, that is, a way of weighting the differences between the observed ( S ) and model-implied ( Ŝ ) covariance matrices. In matrix terms, we can define this as F = ( s sˆ) W( s sˆ) (1) ij where s and ŝ are nonduplicated elements of the observed and implied covariance matrices S and Ŝ and arranged as vectors. The goal of the analysis is to minimize this function by taking partial derivatives of it by the model parameters with respect to the elements of the two covariance matrices S Ŝ. So, for example, if we have a 3 x 3 covariance matrix, the lower part of the matrix would become a sixelement vector (3 variances and 3 covariances), and ( s s ˆ) would contain the differences between the elements in the two covariance matrices (Loehlin, 1992). The exact form of the discrepancy function is different for each estimation method, and each can have its own set of advantages and disadvantages. In Eq. 1 above, W is a weight matrix and different versions of it (i.e., ML, GLS, WLS) will yield different criteria for weighting the differences between the corresponding elements in the observed and implied covariance matrices. If W in Eq. 1 is an identity (I) matrix (which has 1s as the diagonal elements and 0s as the offdiagonal elements), the expression reduces to the following: ( s sˆ)( s sˆ). This is actually just the sum of the squared differences in the elements in the observed and implied covariance

2 Ronald H. Heck Model Estimation Example 2 matrices, which happens to be the ordinary least squares (OLS) criterion. Unweighted least squares (ULS) estimation is the same as OLS, in that the weight matrix (i.e., the weighted sum of the differences) is also just an identity matrix. Loehlin talks about squaring them because the function above essentially means the product of the two deviations. If the two matrices are identical, the value of the expression will be 0. The greater the difference between the two matrices, the squared differences in their elements will increase. The sum of these is the discrepancy function (F). The larger the discrepancy function becomes the worse the fit, which implies less similarity between the elements in the two matrices. Model estimation involves trying to minimize F by seeking values of the unknown model parameters that make the implied covariance as much like the observed covariance matrix as possible (Loehlin, 1992). For OLS, for example, this is often when the metrics of the variables in the covariance matrix are measured on the same type of scale of measurement. In comparison to OLS (or ULS) estimation, GLS, ML and WLS require considerably more computation. As Loehlin (1992) notes, for variables that are normally distributed (or relatively so), Eq. 1 reduces to the following: 1/2 [( - ˆ) ] 2 tr S S V, (2) where tr is the trace of the matrix (i.e., the sum of the diagonal elements) and V is another weight matrix. This formulation helps clarify the differences between ULS, GLS and ML estimation. As noted for ULS, the weight matrix V is defined as identify (V = I). For GLS, it is the inverse of the 1 sample covariance matrix (V = S ), and for ML it is defied as the inverse of the model-implied 1 covariance matrix (V = Ŝ ). Because the ML discrepancy function uses the inverse of the model-implied covariance matrix Ŝ, which has to be recalculated at each iteration, this makes ML estimation more challenging under certain conditions. It should be noted that ML is typically defined somewhat differently from Eq. 2 (Loehlin, 1992): ˆ 1 F = tr( SS ) p+ ln( Sˆ ) ln( S ), (3) ML that is, the discrepancy function is defined in terms of the trace (i.e., sum of the diagonal elements in the matrix) of the products of the sample and model-implied covariance matrices and the natural logarithms of the determinants of the model-implied and sample covariance matrices, given the number of variables (p) in the matrix. This leads to similar minimizing of the discrepancy function (Loehlin, 1992); however, it is often advantageous to work with logarithms, which often can facilitate solving the discrepancy function more easily. Note that in each of these cases, we are assuming that only the covariance matrices are being estimated (modeling mean structures simply requires additional terms added to each discrepancy function). As this discussion suggests, each general approach to model estimation rests on a somewhat different set of assumptions and statistical theory underlying the estimation of various kinds of models. Since GLS uses the inverse of the S covariance matrix as the weight matrix, an advantage is that it only needs to be calculated once, since the observed covariance matrix S does not change (i.e., it has been described therefore as ML estimation with a single iteration). As noted, ML depends on the model-implied covariance matrix and, therefore, typically more complex calculations (with multivariate normality and large sample sizes, however, GLS and

3 Ronald H. Heck Model Estimation Example 3 ML will produce very similar estimates). In cases where the outcomes are categorical (e.g., dichotomous, ordinal), estimation is considerably more complex than OLS regression models for continuous outcomes, since they depend on estimating probability relationships which follow sampling distributions other than the normal distribution. Such models (referred to as generalized linear models) therefore require iterative techniques such as ML to solve the implied set of relationships. GLS and ML can be used to derive a chi-square fit index through the calculation of: ( N 1) F 2 χ = min, (4) F where min is the value of the discrepancy function at the point of best fit and N is the sample size. As you have likely encountered, however, this model fit index is not always favored because of its reliance on sample size, which can lead to rejecting relatively good-fitting models in larger samples. Empirical work suggests ML estimation will work pretty well with skewness of +/-2 and kurtosis of +/-7 (West et al., 1995). WLS can provide this too, and it does not depend on multivariate normality (i.e., it is often used with ordinal types of outcomes in SEM). However, WLS is based on the variances and covariances among the vector elements (s) with the observed covariance matrix S. So as the original covariance matrix S gets larger, the vector s of its nonduplicated elements increases rapidly in length, and then the weight matrix, whose size is the square of the length of that vector, can become quite large and demanding in terms of calculation of model parameters. Therefore, WLS typically requires much larger sample sizes than ML and GLS estimation. For ML and GLS, model convergence problems certainly increase in samples of 100 or less (and fewer than 3 indicators per factor in factor models). Heywood cases in factor models are also very likely to occur under those sorts of conditions. Where one has cases and at least 3 indicators per factors, convergence becomes less of a problem. ML Estimation ML estimation is probably most often used to estimate various types of models with interval and categorical outcomes, but it does depend on relatively large sample sizes (we can use restricted maximum likelihood in small samples). ML estimation determines the optimal population values for parameters in a model that reduces the discrepancy between the observed and implied matrices, given the current parameter estimates (Hox, 2010). As noted, in ML estimation the discrepancy function is defined in terms of a likelihood function (or likelihood) that the model with a particular set of estimates could have produced the observed covariance matrix. In many cases (since functions may be exponential in nature), it is more convenient to work in terms of the natural logarithm of the likelihood function, called the log-likelihood. One advantage of the log-likelihood is that the terms are additive (instead of products). Because the likelihood of the data can vary from 0.0 to 1.0, rather than maximizing the likelihood function, ML uses a more conceptually convenient function that is inversely related to the likelihood function (described previously), such that the smaller this discrepancy function is, the greater the likelihood that the model with a particular set of parameter estimates could have produced the sample covariance matrix (S). The value will be 0 if the model fits the data perfectly (i.e., the natural log of 1 = 0).

4 Ronald H. Heck Model Estimation Example 4 Note also that the log-likelihood function is in the negative quadrant because of the logarithm of a number between 0 and 1 is negative (e.g., the natural log of 0.2 is -1.61). Estimating the parameters involves making a series of iterative guesses that determines an optimal set of weights for random parameters in the model that minimizes the natural logarithm multiplied by the likelihood of the data. Arriving at a set of final estimates is known as model convergence (i.e., where the estimates no longer change and the likelihood is therefore at its maximum value). It is important that the model actually reaches convergence, as the resulting parameter estimates will not be trustworthy if it has not. Sometimes increasing the number of iterations will result in a model that converges, but often, the failure of the model to converge on a unique solution is an indication that it needs to be changed and re-estimated. Keep in mind that even if a model converges, it does not mean the estimates are the right ones, given the sample data. In the same way, we would not conclude that because we fail to reject a model as consistent with the observed data, that it is the only model that would fit this criterion. For models with categorical outcomes, the likelihood function is a little different from models with continuous outcomes (owning to their different sampling distributions), but the principle of model estimation is the same. In this latter case, ML estimation often employs Fisher scoring, which uses a likelihood function that captures the probability of the observed data that would be obtained over a range of parameter values. For Poisson or binomial distributions this algorithm is simplified to the Newton-Rapson procedure (Azen & Walker, 2011). Both algorithms proceed through making an initial guess for all the model parameters and then adjusting the guess by a second set of model parameters that is adjusted to increase the likelihood function. This is repeated until the estimates no longer change and the iteration process has converged on the values of the final ML estimates (Azen & Walker, 2011). ML estimation produces a model deviance statistic (which is often referred to as -2LL or -2*log likelihood), which is an indicator of how well the model fits the data. We multiply the log likelihood by -2 so it can be expressed easily as a positive number. Models with lower deviance (i.e., a smaller discrepancy function) fit the data better than models with larger deviance. Once we have a solution that converges, we can assess how well the proposed model fits the data using various model fit indices. We can also look at the residuals (or residual matrix) that describes the difference between the model-implied covariance matrix and actual covariance matrix. Large residuals imply that some aspects of the proposed model do not fit the data well. An Example Using an Ordinal Outcome Let s say we wish to estimate a model where the outcome is ordinal and there are two predictors (score on a math test and gender). We will use GENLIN in IBM SPSS since we can easily print relevant information about the model estimation procedures. Model 1: Threshold Only Model (no predictors) We first estimate a baseline model with no predictors. Below (Table 1) we have information about the type of model (the probability model is multinomial, which is appropriate for ordinal outcomes) and the link function (because the outcome is not continuous) is the cumulative logit.

5 Ronald H. Heck Model Estimation Example 5 Table 1. Model Information Dependent Variable courses a Probability Distribution Multinomial Link Function Cumulative logit a. The procedure applies the cumulative link function to the dependent variable values in ascending order. Below (Table 2) we have the distribution of perceptions about taking additional math courses past Algebra I, which shows 45% of students perceived they would not take any further math classes beyond Algebra I (45%), about 38.5% perceived they would take one additional course, and another 15.3% perceived they would take two additional courses. We can also see that only about 1.3% perceived they would take 3-4 additional courses beyond Algebra I. Table 2. Categorical Variable Information N Dependent Variable courses Percent % % % % % Total % We first estimate a model with just the thresholds (i.e., the intercepts). We can see in Table 3 that at the first iteration, we have the initial log likelihood estimate. Because the likelihood, or probability, of the data can vary from 0.0 to 1.0, it is common to take the log of it. The log likelihood in the table below is interpreted as the negative natural log of the likelihood function. The log of 1 is 0 (which would indicate no discrepancy) so, for example, if the initial log likelihood is approximately , that would be an initial likelihood function that is quite small (like just above 0), which suggests that the current model does not fit the data very well. As we add variables the log likelihood is reduced (i.e., closer to 0), which amounts to reducing the discrepancy function (or maximizing the likelihood that the proposed model accounted for the observed data). Table 3. Iteration History Iteration Update Type Number of Step-halvings Log Likelihood a Parameter Threshold (Scale) [courses=0] [courses=1] [courses=2] [courses=3] 0 Initial Scoring a. The kernel of the log likelihood function is displayed.

6 Ronald H. Heck Model Estimation Example 6 Below we can examine various fit criteria. Table 4. Goodness of Fit Value df Value/df Deviance Scaled Deviance Pearson Chi Scaled Pearson Chi Log Likelihood a Akaike's Information Criterion (AIC) Finite Sample Corrected AIC (AICC) Bayesian Information Criterion (BIC) Consistent AIC (CAIC) a. The kernel of the log likelihood function is displayed and used in computing information criteria. Here are the thresholds between the various categories of the outcome variable. Table 5. Parameter Estimates Parameter B Std. Error Hypothesis Test Wald Chi- df Sig. [courses=0] Threshold [courses=1] [courses=2] [courses=3] (Scale) a. Fixed at the displayed value. Model 2: Adding Two Predictors (test score and gender) 1 a Of greater interest is what happens when we add predictors to the model. Our assumption is that adding gender and previous test performance will reduce the size of the log likelihood function. Table 6. Continuous Variable Information N Minimum Maximum Mean Std. Deviation Covariate test female

7 Ronald H. Heck Model Estimation Example 7 Below (Table 7) we can see the iteration history for estimating the model with two predictors. We have the initial estimate of the log likelihood (which is for the model with no predictors). Then the model begins to iterate (using maximum likelihood) to solve the equation in a way that maximizes the estimates of the effects of each predictors on the outcomes. You can see that it takes several trials or iterations to reach an optimal solution of the population estimates from the sample data. You can see at each iteration that the estimates of the test score effect and the female (or gender effect) change a little, until the convergence criteria are satisfied. Table 7. Iteration History Iteration Update Type Number of Stephalvings Log Likelihood b Parameter Threshold test1 female (Scale) [courses=0] [courses=1] [courses=2] [courses=3] 0 Initial Scoring Newton Newton Newton Newton Newton a Redundant parameters are not displayed. Their values are always zero in all iterations. Model: (Threshold), test1, female a. All convergence criteria are satisfied. b. The kernel of the log likelihood function is displayed. Next, in Table 8 we see a summary of the various fit indices for the model. Table 8. Goodness of Fit Value df Value/df Deviance Scaled Deviance Pearson Chi Scaled Pearson Chi Log Likelihood a Akaike's Information Criterion (AIC) Finite Sample Corrected AIC (AICC) Bayesian Information Criterion (BIC) Consistent AIC (CAIC) Model: (Threshold), test1, female

8 Ronald H. Heck Model Estimation Example 8 a. The kernel of the log likelihood function is displayed and used in computing information criteria. From Table 8 we can see that the log likelihood has been reduced considerably in this model. Some of this other model fitting information may be familiar to you (e.g., AIC and BIC). AIC and BIC are estimated from the log likelihood (with additional terms). For example, for the AIC index (where k is the number of parameters in the model): AIC = 2k + ( 2LL) = 2(6) + 18, = The likelihood ratio chi-square, which is calculated directly from the change in the log likelihoods between the initial (no predictors) model, and the second model (with 2 predictors) can be used to construct a test of whether Model 2 fits the data better than Model 1 (the baseline model). Table 9. Omnibus Test a Likelihood Ratio Chi- Df Sig a. Compares the fitted model against the thresholds-only model. We can see that the chi square is significant for 2 degrees of freedom (the two added predictors). Here is how we calculate the coefficient from the change in log likelihoods. Initial log likelihood Log likelihood for Model with 2 predictors Difference in log likelihood*2 ( *2) = (likelihood ratio chi-square for 2 df)

9 Ronald H. Heck Model Estimation Example 9 Finally, we can see the summary of the parameters in the model. We can see the earlier test score (I think it is an 8 th grade test) is a significant predictor of students perceptions of math course taking beyond Algebra I, while gender is not. Table 10. Parameter Estimates Parameter B Std. Error Hypothesis Test Wald Chi- df Sig. [courses=0] Threshold [courses=1] [courses=2] [courses=3] test Female (Scale) 1 a Model: (Threshold), test1, female a. Fixed at the displayed value. We could add further variables and see if we could reduce the log likelihood further, but we will stop here for now. This should provide some type of example regarding how model estimation proceeds and how the criteria used to estimate the model results in a set of parameters and model fit criteria that can be used to evaluate how well the proposed model compares against the actual sample covariance matrix. References Azen, R. & Walker, C. (2011). Categorical data analysis for the behavioral and social sciences. New York: Routledge. Hox, J. (2010). Multilevel analysis: Techniques and applications (2 nd Edition). New York: Routledge. Loehlin, J. C. (1992). Latent variable models: An introduction to factor, path, and structural analysis (2 nd Edition). Hillsdale, NJ: Lawrence Erlbaum. Marcoulides, G. & Hershberger, S. (1997). Multivariate statistical methods: A short course. Mahwah, NJ: Lawrence Erlbaum. West, S., Finch, J., & Curran, P. (1995). Structural equation models with nonnormal variables: Problems and remedies. In R. H. Hoyle (Ed.), Structural equation modeling. Concepts, issues, and applications (pp ). Thousand Oaks, CA: Sage.

Class Notes: Week 8. Probit versus Logit Link Functions and Count Data

Class Notes: Week 8. Probit versus Logit Link Functions and Count Data Ronald Heck Class Notes: Week 8 1 Class Notes: Week 8 Probit versus Logit Link Functions and Count Data This week we ll take up a couple of issues. The first is working with a probit link function. While