Structural Equation Modeling and Confirmatory Factor Analysis. Types of Variables

/4/04 Structural Equation Modeling and Confirmatory Factor Analysis Advanced Statistics for Researchers Session 3 Dr. Chris Rakes Website: http://csrakes.yolasite.com Email: Rakes@umbc.edu Twitter: @RakesChris Types of Variables Nominal: Names, Categories, ID numbers Ordinal: Ranks Interval: Dichotomous, Polytomous (No Absolute Zero) Ratio: Measurements, Scalars (Absolute Zero)

/4/04 Describing Data by the Center Example Data Set: 50, 0,, 7,, 5, 0 Mean: Center Value x 7 0 0 5 50 4 x 6.3 n 7 7 Median: Center Term,, 7, 0, 0, 5, 50 Mode: Most Often Repeated Term(s) Degrees of Freedom Number of independent observations Consider a group of 4 observations. The mean is 0 (sum = 80). 0, so we estimate μ = 0 In the next sample, we ve already estimated the population mean to be 0, so 4 data points must sum to 80. The first 3 observations are free to be anything, but the fourth must be fixed to make the sum 80. Free + Free + Free + Fixed = 80 So, we always lose a degree of freedom when we estimate a parameter.

/4/04 Variance and Standard Deviation Consider a sample data set 59 63 7 67 64 7 66 X X -7-7 -3 5-6 0 Is the Mean Difference - -3 5 6 If the sum of the difference is 0, how can we compute a meaningful average? c P b Notice that a + b c c P a a b 3

/4/04 Enter Pythagoras a + b = c c a b a b c So, square areas can be used to calculate distance, and they eliminate the 0 sum problem. Returning to Sample Data 59 63 7 67 64 7 66 Distance (Distance) -7-3 5-6 0 X X 49 9 5 4 36 X X 4 Let s look at a picture of these squares 4

/4/04 0 9 8 7 6 5 4 3 0 55 56 57 58 59 60 6 6 63 64 65 66 67 68 69 70 7 7 73 74 75 - - -3-4 -5-6 -7-8 -9-0 How can we find the side length of this square? 4.5 4.95 5 = Why 5 and not 6? Variance = 4.8 5

/4/04 What does that get us? The average square of mean distances is referred to as variance. Variance σ X X n This noise gives us a measure of how much of the data is not represented by the mean. Standard Deviation: The side length of the variance square; the average distance from the mean Standard Deviation σ σ X X n Two Variables: Lines of Best Fit Linear Regression or Least Squares Regression How far do points regress from a line? Regress = Deviate from. For two variables, we begin by considering the dependent variable, average distance of each Y from. X Y 9 6 5 8 3 9 5 4 9 4 6 Y Y 5 3 0 4 0 = 6

/4/04 Converting to Squares Y 9 3 5 4 6 N = 8 df = 7 Y Y Y Y 5 5 3 9 4 4 0 0 4 Y Y SS? Y So the average sized square of variance is: 48 Y Y df? Then the average distance from mean is: Y Y df 48 7 = 6.9? =.6 Do the same for X X 6 5 8 9 4 9 N = 8 df = 7 X? 8 X X X X 3 9 4 3 9 0 0 4 6 4 6 X X 0 X X SS X? 56 X X df X X df?? 56 7 =.83 =8 7

/4/04 r X Y X X Y Y X X Y Y X X Y Y 9 3 5 9 5 5 6 3 4 9 6 5 3 9 4 6 8 3 0 0 0 9 5 4 4 6 4 8 9 4 0 0 0 6 4 6 4 8 XY sxy s s X Y SPXY SS SS X Y SS X? Y? 56 48 r XY SS X X Y Y? 44 SP XY 5648 How strong is the relationship between X and Y? 44 7 6.9 44 0. 849 5. 846 44 Estimate Line of Best Fit ˆ Y i b 0 b X i b SP 44 0 786 56 XY. SS X 0. 7868. 0 88 b0 Y b X 4. 0. So, Yˆ.88 0. 786 i X i 8

/4/04 Use Linear Equation Information X Y X X Y Y X X Y Y X X Y Y Ŷ Y Yˆ Y Ŷ 9 3 5 9 5 5 6.358.64 6.98 6 3 4 9 6.48.48.04 5 3 9 4 6.64 0.358 0.3 8 3 0 0 0 4.00 9 5 4.786 0.4 0.05 4 4 6 4 8 0.856.44.3 9 4 0 0 0 4.786 0.786 0.6 6 4 6 4 8 7.44.44.3 48 Yˆ i.88 0. 786X i ˆ Y Y? 3.43 How much of an effect did the regression have? SS regression 48 3.49 34.57 r XY SS regression SS Y 34. 57 0. 70 48. 000 9

/4/04 SEM *Causal* processes can be represented by structural equations (regression equations dependent variables being predicted by independent variables). A model of these structural relations can be generated (and modeled pictorially) Error SEM Variables Observed or manifest or measured variables: X s or Y s. Latent variables (factors) constructs that cannot be directly observed (or measured). Latent variables are estimated through hypothesized relationships with observed variables. Exogenous latent variables independent variables that cause changes in other latent variables in the model. These are taken as given by the model under consideration, and any changes in exogenous variables are due to factors outside the model. Endogenous latent variables dependent variables that are influenced by exogenous variables in the model. These are the outcomes your SEM model wishes to explain. Observed (X) Exogenous Endogenous Latent Residual Latent Factor Loadings Latent Factor Loading Observed (Y) Error 0

/4/04 Factor Analysis Used to identify the factor structure or model for a set of variables (Stevens, 0) Two types: Exploratory (EFA) and Confirmatory (CFA) Exploratory Factor Analysis Several Methods: Principal Components Analysis (PCA): Each successive component accounts for the largest amount of unexplained variance Principal Axis Factoring: Identical to PCA, except that the factors are extracted from a correlation matrix with communality estimates on the main diagonal rather than s, as in PCA. Unweighted Least Squares: Minimizes the sum of squared differences between the observed and modelimplied off-diagonal correlation matrices. Generalized Least Squares: Correlations weighted by the inverse of their uniqueness, high uniqueness less weight. Alpha: Maximizes the Cronbach alpha of the factors (i.e., reliability) Image: Factors are defined by their linear regression on variables not associated with the hypothetical factors.

/4/04 Maximum Likelihood Estimation Attempts to find the population parameter values from which the observed data are most likely to have arisen. The likelihood function quantifies the discrepancy between the observed and model-implied parameters, assuming normal distribution. Closed-form solutions for parameters usually do not exist, so iterative algorithms are used in practice for parameter estimation. 3 The Model Fitting Process Let S = the sample variance/covariance matrix of observed scores from p variables. Let Σ = the variance/covariance matrix of the population. Let θ represent the vector of model parameters. Therefore, Σ(θ) represents the restricted variance/covariance matrix implied by the model. We are testing the hypothesis that the restricted matrix holds in the population: Null Hypothesis: Σ = Σ(θ). SEM computes a minimum discrepancy function, F min. 4

/4/04 Trace: The sum of the diagonal of a matrix Understanding the F min Function So, as Σ(θ) approaches S, the difference of the trace and p approaches 0. F Min Trace S p log log S An inverse matrix times itself = the Identity Matrix (I), So, as Σ(θ) approaches S, Σ(θ) - S approaches I, as a result, the trace of the matrix will approach the number of observed variables, p As Σ(θ) approaches S, this difference approaches 0 5 Maximum Likelihood Estimation (Cont d.) The shape of the multivariate normal curve is defined by: l Σ Substituting an individual s vector of scores yields the likelihood of that set of scores given the population mean vector μ and covariance matrix Σ 6 3

/4/04 Maximum Likelihood Estimation (Cont d.) A model s final parameter estimates are those that yield model-implied variances and covariances (and means) that maximize the combined likelihood of all n cases. l l l l l l Σ 7 Casewise Log Likelihoods Likelihoods tend to be very small numbers, and hence their products become practically infinitesimal. Taking the natural log of the likelihood makes things a bit more manageable. l l l l l l Σ Σ 8 4

/4/04 Casewise Log Likelihoods (Cont d.) With complete data, each case s contribution to the overall log likelihood (LL) is: Σ Σ In the missing data context, each case s contribution to the log likelihood is: Σ Σ Data and parameter arrays can vary for each ith case. The ith case s contribution to the overall likelihood is based only on those variables for which that case has complete data. 9 Maximum Likelihood in SEM Model s final parameter estimates are those that yield model-implied variances and covariances (and means) that maximize the aggregated casewise log likelihoods: Σ Σ In FIML, no data are ever imputed. Parameters and their SE are estimated directly using all observed data. FIML is the default in many software (e.g., Mplus, Amos) 30 5

/4/04 Confirmatory Factor Analysis Cannot be run easily in basic statistics packages such as SPSS they do not offer the option to force variables to load on particular factors, only the number of factors. SEM software easily accommodates CFA models, e.g., MPlus, AMOS, EQS, LISREL. 3 Psychological Distress CFA First-Order CFA Second-Order CFA 3 6

/4/04 Psychological Distress CFA Results Chi RMSEA RMSEA Model Model Description N AIC DF Square CFI RMSEA LO90 HI90 SRMR ECVI CFA Caregiver Psychological 0a Distress 7 898. 03 83.8 0.90 0.088 0.076 0.00 0.049 36.56 0a 0a with Q030 and Q03 covaried 7 836.7 0 9.97 0.935 0.07 0.058 0.084 0.044 36.8 0a3 nd order CFA built on 0a 7 838.7 0 9.97 0.935 0.07 0.059 0.085 0.044 36.9 Variable Criterion Minimum Fit χ Nested Model Comparison CFI (Comparative Fit Index) > 0.95 AIC (Akaike Information Criterion) SRMR (Standardized Root Mean Square Residual) RMSEA (Root Mean Square Error of Approximation) ECVI Model Comparison Only (Does not have to be nested), Smaller Value = Better Fit < 0.0, Reasonable Fit < 0.08 Good Fit < 0.05 = Good Fit 0.05 0.08 = Reasonable 0.08 0.0 = Mediocre > 0.0 = Poor Fit As model is changed, smaller value indicates greater likelihood of being generalizable in the population 33 Reflection on CFA What is your dissertation/thesis conceptual framework? Are the constructs in your framework welldefined, and are the definitions wellestablished? Could a CFA strengthen your study? Why or why not? 34 7

/4/04 Thank You! All materials from this workshop series can be downloaded at http://csrakes.yolasite.com/resource.php 35 8