Related Concepts: Lecture 9 SEM, Statistical Modeling, AI, and Data Mining. I. Terminology of SEM

Size: px

Start display at page:

Download "Related Concepts: Lecture 9 SEM, Statistical Modeling, AI, and Data Mining. I. Terminology of SEM"

Alison Casey
5 years ago
Views:

1 Lecture 9 SEM, Statistical Modeling, AI, and Data Mining I. Terminology of SEM Related Concepts: Causal Modeling Path Analysis Structural Equation Modeling Latent variables (Factors measurable, but thru items) are not directly LISREL (Linear Structural Relations) Covariance Structure Analysis*** ***Key clue to what is involved and how it is done! 1

2 Common Elements: Involves a system of variables. Variables are ordered in (theoretically derived?) sequence (structure) or model, implying assumedly causal relationships (structural relationships or paths). System of relationships between variables is specified by series of equations (structural or prediction), like multiple regression analysis. Structural equations define model to be tested for fit Fit assessed generally against data Fit may be assessed against alternative models Fit of data to the theory! Thus, SEM is for theory testing, not for exploratory study! Modeling involves testing of hypothesized model. Two levels of significance testing: Testing individual structural parameters for significance Is a specific path regression coefficient significant Like Multiple Regression (Ho: β = 0) Look at fit of overall model to data (goodness of fit). Variety of models can be posited; how well do they fit? Posited relationships use to estimate expected correlation matrix; compare via chi-square test with observed correlation matrix. Fit is a property of the system of variables (i.e., the model). Model implies tests of presence and/or absence of paths. 2

3 Comparison with Multiple Regression Multiple Regression X 1 Causal Modeling X 1 X 2 X 3 Y X 3 X 4 Y X 4 X 2 X5 X 5 Q: How well do predictors predict (explain variances) in Y? What are independent effects when effects of other variables are controlled? Q: How well do predictors relate with regard to ultimate prediction of Y? PATH ANALYSIS: Theory specifies existence of paths between variables Two kinds of effects Direct effects X 3 involves direct effects of X 1 and X 2 X 1 X3 X4 X 3 = X 1 + X 2 X 2 X5 Y X 4 = X 2 + X 3 X 5 = X 2 Y = X 4 + X 5 3

4 X 1 X /11/3 The Logic of Model Testing in SEM Begin with an observed correlation matrix for X 1, X 2, and X 3 X 1 X 2 X 3 X r 12 r 13 X r 23 X Hypothesize a structural model to test X 2 This model can be represented by the following equations: ŕ 12 = p 21 ŕ 13 = p 31 +p 32 p 21 (direct effect of X 1 on X 3 plus indirect effect via X 2 ) ŕ 23 = p 32 + p 31 p 21 (direct effect of X 2 on X 3 plus effect of X 1 on X 3 and X 2 ) ŕ ij represent reconstructed or estimated correlations based upon the theoretical model Path coefficients can be estimated using Multiple Regression methods (Standardized Partial Coefficients) based upon a given model and can be used to reconstruct the correlation matrix. The estimated correlations can be compared with the observed correlations and chi-square will show whether it fits (non-significant chi-square denotes good fit.) χ2 test involves observed vs. expected ( reconstructed ) correlations 4

5 Two kinds of effects Direct effects X 3 involves direct effects of X 1 and X 2 X 5 involves direct effect of X 2 only Y involves direct effects of X 4 and X5 Indirect effects : Effect of a variable through another variable X 1 hypothesized to influence X4 via indirect effect through X 3 X 1 posited to influence Y through indirect effects through X 3 and X 4 X 1 X3 X4 X3 = X1 + X2 X 2 X5 Y X 4 = X 2 + X 3 X 5 = X 2 Y = X 4 + X 5 Model Identification- Review Need to scale the latent variables in order to identify the model 1) set one of the regression coefficients for one indicator equal to 1. All other indicators are interpreted relative to this value OR 2) set the variance of the latent variable to 1 (standardizing) Most common method for CFA Just-identified: relationships=hypotheses Over-identified: relationships<hypotheses Under-identified: relationships>hypotheses 5

6 χ2 Goodness of Fit Test Observed Correlations in Data X 1 X 2 X 3 X r 12(o) r 13(o) X r 23(o) X Reconstructed Correlations based upon Path model X ŕ 23(e) X X 1 X 2 X 3 X ŕ 12(e) ŕ 13(e) More similar observed and expected correlations, smaller Chi-square χ2 = Σ(r ij (o) ŕ ij (e)) 2 /ŕ ij (e)) Model Comparison via Incremental Fit Test Each model has a χ2 value based upon a certain degree of freedom If models are nested (ie., identical but M 2 deletes one parameter found in M 1 ), significance of increment or decrement in fit = Chi-square difference test: χ2 1 - χ2 2 with df = df 1 df 2 6

7 Theory Trimming: Adding/Deleting Paths Step Down: Run Full model, identify nonsignificant paths; re-run model dropping paths oneby-one; assess changes in fit. Empirical approach. Model should be theory based. Step Up: Run reduced (hypothesized) model ; re-run with missing paths to assess if addition of those paths improves model fit Look at: Significance of paths Overall improvement in model fit Should be theory driven LISREL and SEM: Combine Structural and Measurement Models using Confirmatory Factory Analysis (CFA) Structural Models: The Causal relations β s linking independent and dependent SES β DEPRESS Measurement Models Factor Structure involved in latent constructs ED INC OCC l 1 l 3 SES 7

8 Fit Indices Overall χ2 a poor measure of fit With large n, χ2 will generally be significant Alternative fit indices developed to avoid problems with χ2 Involve improvement in fit compared to some reference model Normed Fit Index (NFI; Bentler and Bonnet) Measures improvement in fit over independence model (no relationships between model variables) Ranges from 0 (no improvement) to 1 (100% improvement) Others: Non-Normed fit index (NNFI), Goodness of Fit Index (GFI) Model Modification Indices LISREL/AMOS provide model fit indices Also provide modification indices Suggest paths or error co-variances that may improve model fit Show amount of change in χ2 that would result Atheoretical and sample-specific, but may make sense 8

9 Interactions: Product of the two factors Multiple Group Comparisons: Multi-Group SEM totest the Invariance of models from different group; Cross-cultural management Time series data Trend analysis Panel data Analysis of Longitudinal Data Regression models: fixed effects and random effects models Autoregression and ARIMA models Rolling regressions 9

10 Theoretical Modeling Linear algebra Dynamic programming Game theoretical models Econometric models Operations research MCMC Monte Carlo simulations with artificial data Model testing and validation with empirical data (Sample) Statistics models 1. Sample and independence assumptions 2. Simple linear regression analysis 3. Multiple linear regression analysis (with introduction to GLM) 4. Error in variable models 5. Structural equation modeling (with introduction to bootstrap resampling and time series analysis) 6. Logistic regression & residual logistic regression models 7. survival analysis 10

11 Bayesian statistics Three approaches to Probability Axiomatic Probability by definition and properties Relative Frequency Repeated trials Degree of belief (subjective) Personal measure of uncertainty Problems The chance that a meteor strikes earth is 1% The probability of rain today is 30% The chance of getting an A on the exam is 50% Problems of statistical inference Ho: θ=1 versus Ha: θ>1 Classical approach P-value = P(Data θ=1) P-value is NOT P(Null hypothesis is true) Confidence interval [a, b] : What does it mean? But scientist wants to know: P(θ=1 Data) P(Ho is true) =? Problem θ not random 11

12 Bayesian statistics Fundamental change in philosophy Θ assumed to be a random variable Allows us to assign a probability distribution for θ based on prior information, i.e., Bayesian updating 95% confidence interval [1.34 < θ < 2.97] means what we want it to mean: P(1.34 < θ < 2.97) = 95% P-values mean what we want them to mean: P(Null hypothesis is false) Not or less subject to assumption of normal distributions Bayes Theorem 12

13 Bayes Theorem for Statistics Let θ represent parameter(s) Let X represent data Left-hand side is a function of θ Denominator on right-hand side does not depend on θ Posterior distribution Likelihood x Prior distribution Posterior dist n = Constant x Likelihood x Prior dist n Equation can be understood at the level of densities Goal: Explore the posterior distribution of θ Graphic models and Bayesian Networks Tree models: CART, regression trees, random forests Network models Bayenesian networks Neural Networks No a priori model specification More flexible Not subject to the independence assumptions Require statistical learning, artificial intelligence methods, or machine learning 13

14 Artificial Intelligence and Machine Learning Contrast to statistical models that are close-formed Can learn (from the data) statistical models But more suitable for open-formed graphic models Genetic algorithm Genetic programming Evolutionary programming Simulated Annealing Bayesian Networks (BN) Bayesian probability theory (D'Ambrosio 1999; Geiger and Heckerman 1996; Haddawy 1999; Heckerman and Wellman 1995; Larrananga et al, 2000) joint distribution of probabilities based on local distribution DAG, posterior model, and causal structures (example) data handling capability free from assumptions and constraints, field independent MDL metric - minimum describing length( Lam 1998 and Lam Bacchus 1994) - to emulate any fitness function minimize error and improve accuracy 14

15 Research Problems A priori research: prior selection bias, tinted glasses, too slow Genetic Algorithm: 1) computationally intractable, 2) lacks the ability to represent knowledge structure or to build a model. Difficult to interpret or to compare Solution: Bayesian Networks for model building Evolutionary Programming for learning, search and optimization An Example of BN 15

16 Bayesian Network Learning To automatically construct the network structure and the associated conditional probability parameters from a database. Two phases: 1) Learn a network structure that best characterizes the input data. 2) Compute the probabilities associated with the nodes using standard statistical methods. The second phase is more straightforward. Most of the difficulties are in the first phase. Evolutionary Programming (EP) Evolutionary theory: survival of the fittest Evolutionary computation: genetic algorithms (GA), genetic programming (GP), evolutionary programming (EP), and evolution strategy. Individual (Bayesian Network models) and the population Parents and offspring Mutation operators Optimization by MDL in tournament: N=50 Generations: 5000 The model with the highest tournament score ( the lowest MDL score) 16

17 Minimum Description Length (MDL) There is a trade-off between accuracy and usefulness. A more complex network is usually more accurate, but computationally and conceptually more difficult to use. The MDL principle states that the best model of a collection of data is the one that minimizes the sum of the encoding lengths of the data and the model itself. The MDL metric measures the total description length DL of a network structure G. A better network has a smaller value on this metric. Other criteria: BIC, AIC, etc. Mutation Operators Simple mutation: randomly adds an edge between two nodes or randomly deletes an existing edge from the parent. Reversion mutation: randomly selects an existing edge and reverses its direction. Move mutation: randomly selects an existing edge. It moves the parent of the edge to another node, or moves the child of the edge to another node. Knowledge-guided mutation based on model MDL scores 17

18 The Experiments credit card database ( 308,857 people, ITA, 2,000 variables, 1,623 responders, R=0.53%, N=3,785 catalog marketing database (106,284 consumers, 12 catalogs' promotions, 12 years, 1995 Census data and TRW credit information, over 300 variables, and R=11.3%) variable selection (logistic regression forward selection) sampling: training sets and testing sets 10 fold cross-validation with non-overlapping samples The Results The Algorithm of Evolutionary Programming The Algorithm for Evolutionary Programming to Learn Bayesian Networks Bayesian models and Bayesian probability scores Credit card promotion: deciles analysis and gains chart Examples of a learned Bayesian Network model Catalog marketing: decile analysis and gains chart Parameter estimates: examples 18

19 19

20 Methods and Data from multiple sources Multi-method multi-data research Primary data: survey, experiment Secondary data: census, World Bank, IMF, government agencies, crawled data from the web Qualitative + quantitative analyses: QCA,sentiment analysis, sentiment analysis, text/audio/graphic/video Empirical data analysis + experiments 20

Lecture 6: Graphical Models: Learning

Lecture 6: Graphical Models: Learning 4F13: Machine Learning Zoubin Ghahramani and Carl Edward Rasmussen Department of Engineering, University of Cambridge February 3rd, 2010 Ghahramani & Rasmussen (CUED)