Correlation and Regression Bangkok, 14-18, Sept. 2015

Similar documents
Single and multiple linear regression analysis

Multiple linear regression S6

Analysing data: regression and correlation S6 and S7

Prepared by: Prof. Dr Bahaman Abu Samah Department of Professional Development and Continuing Education Faculty of Educational Studies Universiti

Practical Biostatistics

REVIEW 8/2/2017 陈芳华东师大英语系

Readings Howitt & Cramer (2014) Overview

Readings Howitt & Cramer (2014)

Correlation and simple linear regression S5

Correlation and Linear Regression

Regression Analysis. BUS 735: Business Decision Making and Research. Learn how to detect relationships between ordinal and categorical variables.

x3,..., Multiple Regression β q α, β 1, β 2, β 3,..., β q in the model can all be estimated by least square estimators

Introduction to Regression

Interactions between Binary & Quantitative Predictors

Example: Forced Expiratory Volume (FEV) Program L13. Example: Forced Expiratory Volume (FEV) Example: Forced Expiratory Volume (FEV)

Statistics in medicine

Categorical Predictor Variables

Correlation. A statistics method to measure the relationship between two variables. Three characteristics

Variance. Standard deviation VAR = = value. Unbiased SD = SD = 10/23/2011. Functional Connectivity Correlation and Regression.

Important note: Transcripts are not substitutes for textbook assignments. 1

y response variable x 1, x 2,, x k -- a set of explanatory variables

Lecture 18: Simple Linear Regression

Psychology Seminar Psych 406 Dr. Jeffrey Leitzel

Regression: Main Ideas Setting: Quantitative outcome with a quantitative explanatory variable. Example, cont.

Multiple linear regression

Data Analysis as a Decision Making Process

FAQ: Linear and Multiple Regression Analysis: Coefficients

Multiple Regression. More Hypothesis Testing. More Hypothesis Testing The big question: What we really want to know: What we actually know: We know:

36-309/749 Experimental Design for Behavioral and Social Sciences. Sep. 22, 2015 Lecture 4: Linear Regression

Unit 6 - Introduction to linear regression

Multiple Linear Regression II. Lecture 8. Overview. Readings

Multiple Linear Regression II. Lecture 8. Overview. Readings. Summary of MLR I. Summary of MLR I. Summary of MLR I

Predictive Analytics : QM901.1x Prof U Dinesh Kumar, IIMB. All Rights Reserved, Indian Institute of Management Bangalore

Review of Multiple Regression

Correlation: Relationships between Variables

Multiple Linear Regression I. Lecture 7. Correlation (Review) Overview

Regression Analysis. BUS 735: Business Decision Making and Research

Univariate analysis. Simple and Multiple Regression. Univariate analysis. Simple Regression How best to summarise the data?

Review of Statistics 101

Day 4: Shrinkage Estimators

THE PEARSON CORRELATION COEFFICIENT

Advanced Quantitative Data Analysis

Simple Linear Regression

Can you tell the relationship between students SAT scores and their college grades?

Multiple Linear Regression II. Lecture 8. Overview. Readings

Multiple Linear Regression II. Lecture 8. Overview. Readings. Summary of MLR I. Summary of MLR I. Summary of MLR I

Chapter 9 - Correlation and Regression

MANOVA is an extension of the univariate ANOVA as it involves more than one Dependent Variable (DV). The following are assumptions for using MANOVA:

Lecture 5: ANOVA and Correlation

36-309/749 Experimental Design for Behavioral and Social Sciences. Dec 1, 2015 Lecture 11: Mixed Models (HLMs)

Business Statistics. Lecture 10: Correlation and Linear Regression

Daniel Boduszek University of Huddersfield

sphericity, 5-29, 5-32 residuals, 7-1 spread and level, 2-17 t test, 1-13 transformations, 2-15 violations, 1-19

Chapter 4 Regression with Categorical Predictor Variables Page 1. Overview of regression with categorical predictors

Biostatistics for physicists fall Correlation Linear regression Analysis of variance

Contents. Acknowledgments. xix

DEVELOPMENT OF CRASH PREDICTION MODEL USING MULTIPLE REGRESSION ANALYSIS Harshit Gupta 1, Dr. Siddhartha Rokade 2 1

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages:

Checking model assumptions with regression diagnostics

Area1 Scaled Score (NAPLEX) .535 ** **.000 N. Sig. (2-tailed)

Objectives. 2.3 Least-squares regression. Regression lines. Prediction and Extrapolation. Correlation and r 2. Transforming relationships

PBAF 528 Week 8. B. Regression Residuals These properties have implications for the residuals of the regression.

Y (Nominal/Categorical) 1. Metric (interval/ratio) data for 2+ IVs, and categorical (nominal) data for a single DV

Ordinary Least Squares Regression Explained: Vartanian

Regression Model Building

Lecture 4: Multivariate Regression, Part 2

Chapter 19: Logistic regression

Reminder: Student Instructional Rating Surveys

Dr. Junchao Xia Center of Biophysics and Computational Biology. Fall /1/2016 1/46

More Statistics tutorial at Logistic Regression and the new:

Classification & Regression. Multicollinearity Intro to Nominal Data

Data files for today. CourseEvalua2on2.sav pontokprediktorok.sav Happiness.sav Ca;erplot.sav

9. Linear Regression and Correlation

WORKSHOP 3 Measuring Association

Chapter 14. Multiple Regression Models. Multiple Regression Models. Multiple Regression Models

Hierarchical Generalized Linear Models. ERSH 8990 REMS Seminar on HLM Last Lecture!

Inference for Regression Inference about the Regression Model and Using the Regression Line, with Details. Section 10.1, 2, 3

Regression Analysis: Exploring relationships between variables. Stat 251

Unit 6 - Simple linear regression

Chapter 1 Statistical Inference

Job Training Partnership Act (JTPA)

Data Analysis 1 LINEAR REGRESSION. Chapter 03

Example. Multiple Regression. Review of ANOVA & Simple Regression /749 Experimental Design for Behavioral and Social Sciences

ANCOVA. ANCOVA allows the inclusion of a 3rd source of variation into the F-formula (called the covariate) and changes the F-formula

Regression ( Kemampuan Individu, Lingkungan kerja dan Motivasi)

1 Correlation and Inference from Regression

Acknowledgements. Outline. Marie Diener-West. ICTR Leadership / Team INTRODUCTION TO CLINICAL RESEARCH. Introduction to Linear Regression

Types of Statistical Tests DR. MIKE MARRAPODI

Chapter 16: Correlation

Multivariate and Multivariable Regression. Stella Babalola Johns Hopkins University

ESP 178 Applied Research Methods. 2/23: Quantitative Analysis

Finding Relationships Among Variables

In Class Review Exercises Vartanian: SW 540

Regression in R. Seth Margolis GradQuant May 31,

TOPIC 9 SIMPLE REGRESSION & CORRELATION

11 Correlation and Regression

Workshop Research Methods and Statistical Analysis

ECON 497 Midterm Spring

Inter Item Correlation Matrix (R )

Trendlines Simple Linear Regression Multiple Linear Regression Systematic Model Building Practical Issues

Transcription:

Analysing and Understanding Learning Assessment for Evidence-based Policy Making Correlation and Regression Bangkok, 14-18, Sept. 2015 Australian Council for Educational Research

Correlation The strength of a mutual relation between 2 (or more) things You need to know 2 things about each unit of analysis student (e.g. maths and reading performance) school (e.g. funding level and mean reading performance) country (e.g. mean performance in 2010 and in 2013) No assumption about the direction of the relationship Correlation is simply standardised covariance i.e., covariance divided by the product of the standard deviations of the variables.

Formulas Variances: 2 σ ( = X N X 1 ) 2 Standard deviation ( = X σ N X 1 ) 2 Covariances: cov( x, y) ( )( = X X Y N 1 Y ) Correlation (Pearson s r) r = cov( x, y) σ y σ x

A note on sample vs population estimators Sample variances: ( X X ) 2 σ = N 2 Sample covariances: cov( x, y) ( X = X )( Y N Y ) Estimate of variance based on a sample is biased, it underestimates the true variance Needs a correction factor of produce an unbiased estimate N N 1 to

Type of correlation The correlation coefficient to use depends on the level of measurement of the variables Ordinal ranks, Likert scales, ordered categories Spearman correlation (ρ), Kendall s tau (τ) Interval/Ratio metric scales, measures of magnitude Pearson correlation (ρ)

Things to remember Independence are the two values independent of each other? Linearity is the relationship between the two values linear? Normality are the two values distributed normally? (if not, non-parametric correlation should be used)

Correlation values 0 = no relationship 1.0 = perfect positive relationship -1.0 = perfect negative relationship 0.1 = weak relationship (if significant) 0.3 = moderate relationship (if significant) 0.5 = strong relationship (if significant)

Strong correlation r =.80

Perfect correlations r = 1 r = -1

Moderate correlation r =.36

No correlation r =.06

Correlation vs Regression Correlation is not directional. The degree of association goes both ways. Correlation is not appropriate if the substantive meaning of X being associated with Y is different from Y being associated with X. For example, Height and Weight. Not appropriate when one of the variables is being manipulated, or being used to explain the other. Use regression instead.

Practical exercises Be careful about spurious correlations. Just because two variables correlate highly does not mean there is a valid relationship between them. Correlation is not causation. With large enough data, anything can be significantly correlated with something.

Regression Also describes a relationship between 2 things (or more), but assumes a direction Explain one variable with one (or more) other variable(s) How well does SES predict performance?

Regression cont. Two main statistics Size of the effect or slope Strength of the effect or explained variance

The General Idea Simple regression considers the relation between a single explanatory variable and response variable

Line of best fit (OLS)

Line of best fit (OLS)

Size of the effect 1 unit 50 = slope

Size of the effect cont. 1 unit 25 = slope

The R 2 The proportion of the total sample variance that is not explained by the regression will be: RRRRRRRRRRRRRRRR ssssss oooo ssssssssssssss TTTTTTTTTT ssssss oooo ssssssssssssss Therefore, the proportion of the variance in the dependent variable that is explained by the independent variable (R 2 ) will be: RR 2 = 1 RRRRRRRRRRRRRRRR ssssss oooo ssssssssssssss TTTTTTTTTT ssssss oooo ssssssssssssss

Strength of the effect For example, if the residual variance is a small proportion of the total variance R 2 = 1 (162.5/1250) R-squared = 0.87 87 % of the variation in reading is explained by ESCS

Strength cont. For example, if the residual variance is a large proportion of the total variance R 2 = 1 (1075/1250) R 2 = 0.14 Only 14% of the variation in reading is explained by ESCS

Multiple Regression Multiple regression simultaneously considers the influence of multiple explanatory variables on a response variable Y The intent is to look at the independent effect of each variable while adjusting out the influence of potential confounders Source: Gertsman, B. (2008). Basic biostatistics: Statistics for public health practice. Sudbury, MA: Jones and Bartlett Publi

Regression Modeling A simple regression model (one independent variable) fits a regression line in 2- dimensional space residual A multiple regression model with two explanatory variables fits a regression plane in 3-dimensional space This concept can be extended indefinitely but visualisation is no longer possible for >3 variables. Source: Gertsman, B. (2008). Basic biostatistics: Statistics for public health practice. Sudbury, MA: Jones and Bartlett Publishers.

Multiple Regression Model Again, estimates for the multiple slope coefficients are derived by minimizing residuals 2 to derive this multiple regression model: Again, the standard error of the regression is based on the residuals 2 of all x n : Source: Gertsman, B. (2008). Basic biostatistics: Statistics for public health practice. Sudbury, MA: Jones and Bartlett Publi

Multiple Regression Model Intercept α predicts where the regression plane crosses the Y axis Slope for variable X 1 (β 1 ) predicts the change in Y per unit X 1 holding X 2 constant The slope for variable X 2 (β 2 ) predicts the change in Y per unit X 2 holding X 1 constant Source: Gertsman, B. (2008). Basic biostatistics: Statistics for public health practice. Sudbury, MA: Jones and Bartlett Publishers.

Main purpose of regression analysis Prediction Developing a prediction model based on a set of predictor/independent variables. This purpose also allows for the evaluation of the predictive powers between different models as well as different sets of predictors within a model. Explanation Validating or confirming an existing prediction model using new data. This purpose also allows for the assessment of the relationship between predictor and outcome variables.

Regression works provided assumptions are met Linearity Check using partial regression plots (PLOTS Produce all partial plots) Uniform variance (homoscedasticity) Check by plotting residuals against the predicted value (PLOTS Y:ZRESID, X:ZPRED) For ANOVA, check using Levene s test for homogeneity of variance (EXPLORE PLOTS Spread vs Level) Independence of error terms Check by plotting residuals against a sequencing variable (PLOTS Produce all partial plots) Normality of the residuals Check using Normal P-P plots of the residuals (PLOTS Normal probability plot)

Sample size Thorough method: a priori power analysis Compute sample sizes for given effect sizes, alpha levels, and power values (G*Power 3: http://www.psycho.uniduesseldorf.de/aap/projects/gpower/) Fast method (but less thorough): rules of thumb For R 2 significance testing: 50 + 8k For b-values significance testing : 104 + k For both, use the larger number

Multicollinearity y= b 0 + b 1 x 1 y= b 0 + b 1 x 1 + b 2 x 2 but if x 2 = x 1 + 3 y= b 0 + b 1 x 1 + b 2 (x 1 +3) y= b 0 + b 1 x 1 + b 2 x 1 + 3b 2 Checking for multicollinearity For overall multicollinearity: VIF>10; Tolerance <0.10. For individual variables: Identify Condition Index >15, then check the Variance Proportions of each coefficient >.90.

Influential values Influential values are outliers that have substantial effect on the regression line. Source: Field, A. (2005). Discovering statistics using SPSS. (2nd ed). London: Sage.

When does linear regression modelling become inappropriate? When the dependent variable is dichotomous or polytomous (use Logistical Regression). When data are sequential over time and variables are auto correlated (use Time Series Analysis). When context effects need to be analysed and slopes are different across higher level units (use Multi-level Analysis).

Application: Illustrative Example Childhood respiratory health survey. Binary explanatory variable (SMOKE) is coded 0 for non-smoker and 1 for smoker Response variable Forced Expiratory Volume (FEV) is measured in liters/second (lung capacity) Regress FEV on SMOKE least squares regression line: ŷ = 2.566 + 0.711x The mean FEV in nonsmokers is 2.566 The mean FEV in smokers is 3.277 Source: Gertsman, B. (2008). Basic biostatistics: Statistics for public health practice. Sudbury, MA: Jones and Bartlett Publishers.

Example, cont. ŷ = 2.566 + 0.711x Intercept (2.566) = the mean FEV of group 0 Slope = the mean difference in FEV (because x is 0,1) 3.277 2.566 = 0.711 t stat = 6.464 with 652 df, p <.01 (b 1 is significant) The 95% CI for slope is 0.495 to 0.927 Source: Gertsman, B. (2008). Basic biostatistics: Statistics for public health practice. Sudbury, MA: Jones and Bartlett Publishers.

Smoking increases lung capacity? Children who smoked had higher mean FEV How can this be true given what we know about the deleterious respiratory effects of smoking? ANS: Smokers were older than the nonsmokers AGE confounded the relationship between SMOKE and FEV A multiple regression model can be used to adjust for AGE in this situation Source: Gertsman, B. (2008). Basic biostatistics: Statistics for public health practice. Sudbury, MA: Jones and Bartlett Publishers.

Extending the analysis: Multiple regression SPSS output for our example: Intercept a Slope b 1 Slope b 2 The multiple regression model is: FEV = 0.367 +.209(SMOKE) +.231(AGE) Source: Gertsman, B. (2008). Basic biostatistics: Statistics for public health practice. Sudbury, MA: Jones and Bartlett Publishers.

Multiple Regression Coefficients, cont. The slope coefficient associated for SMOKE is.209, suggesting that smokers have.209 less FEV on average compared to nonsmokers (after adjusting for age) The slope coefficient for AGE is.231, suggesting that each year of age in associated with an increase of.231 FEV units on average (after adjusting for SMOKE) Source: Gertsman, B. (2008). Basic biostatistics: Statistics for public health practice. Sudbury, MA: Jones and Bartlett Publishers.

Model 1 Inference About the Coefficients Inferential statistics are calculated for each regression coefficient. For example, in testing H 0 : β 1 = 0 (SMOKE coefficient controlling for AGE) (Constant) smoke age a. Dependent Variable: fev t stat = 2.588 and P = 0.010 Unstandardized Coefficients Coefficients a Standardized Coefficients B Std. Error Beta t Sig..367.081 4.511.000 -.209.081 -.072-2.588.010.231.008.786 28.176.000 df = n k 1 = 654 2 1 = 651 Source: Gertsman, B. (2008). Basic biostatistics: Statistics for public health practice. Sudbury, MA: Jones and Bartlett Publishers.

Inference About the Coefficients The 95% confidence interval for this slope of SMOKE controlling for AGE is 0.368 to 0.050. Model 1 (Constant) smoke age Coefficients a a. Dependent Variable: fev 95% Confidence Interval for B Lower Bound Upper Bound.207.527 -.368 -.050.215.247 Source: Gertsman, B. (2008). Basic biostatistics: Statistics for public health practice. Sudbury, MA: Jones and Bartlett Publi

Assessing the significance of the model R Square (R 2 ) represents the proportion of variance in the outcome variable that is accounted for by the predictors in the model. For example, if for our previous model R 2 =.23, then 23% of the variance in FEV is accounted for by smoking status and age. Adjusted R 2 compensates for the inflation of R 2 due to overfitting. Useful for comparing the amount of variance explained across several models. Standard error of the estimate measure of accuracy of the predictions. For example, if the SE of the estimate = 0.35 for our previous model: FEV = 0.367 +.209(SMOKE) +.231(AGE) then the predicted FEV for a non-smoker aged 12 years is FEV=3.139 +/- (t x 0.35)

Assessing the significance of the model Hierarchical models Suppose Model 1: FEV = 0.367 +.209(SMOKE) +.231(AGE), R 2 =.23 Model 2: FEV = 0.367 +.209(SMOKE) +.231(AGE) +.04(GENDER), R 2 =.29 What is the amount of unique variance explained by gender above and beyond that explained by smoking status and age? FEV FEV GENDER SMOKE AGE SMOKE AGE

Hierarchical regression in SPSS

Dummy Variables More than two levels For categorical variables with k categories, use k 1 dummy variables Ex. SMOKE2 has three levels, initially coded 0 = non-smoker 1 = former smoker 2 = current smoker Use k 1 = 3 1 = 2 dummy variables to code this information like this: Source: Gertsman, B. (2008). Basic biostatistics: Statistics for public health practice. Sudbury, MA: Jones and Bartlett Publishers.

Use of standardised coefficients Often thought to be easier to interpret. Standardisation depends on variances of independent variables. Unstandardised coefficient can be translated directly. Unstandardised coefficients cannot always be compared if different units are used for the variables.

The set of predictors must be chosen based on theory Avoid the whatever sticks to the wall approach. The grouping of predictors and the ordering of entry will matter. Selecting the best final model can sometimes be a judgment call. Finding the best regression model

How to judge whether a model is good? Explained variance proportion as measures by R 2 Size of regression coefficients. Significance tests (F-test for model, t- tests for parameters) Inclusion of all relevant variables (Theory!) Is method appropriate?

The six steps to interpreting results 1. Look at the prediction equation to see an estimate of the relationship. 2. Refer to the standard error of the estimate (in the appropriate model) when making predictions for individuals. 3. Refer to the standard errors of the coefficients (in the most complete model) to see how much you can trust the estimates of the effects of the explanatory variables. 4. Look at the significance levels of the t-ratios to see how strong is the evidence in support of including each of the explanatory variables in the model. 5. Use the coefficient of determination (R 2 ) to measure the potential explanatory power of the model. 6. Compare the beta-weights of the explanatory variables in order to rank them in order of explanatory importance.

Notes on interpreting the results Prediction is NOT causation. In inferring causation, there has to be at least temporal precedence, but temporal precedence alone is still not sufficient. Avoid extrapolating the prediction equation beyond the data range. Always consider the standard errors and the confidence intervals of the parameter estimates. The magnitude of the coefficient of determination (R 2 ), in terms of explanatory power, is a judgment call.

Practice exercises! Study: Mathematics Beliefs and Achievement of Elementary School Students in Japan and the United States: Results From the Third International Mathematics and Science Study (TIMSS). House, J. D., 2006 Interpret the parameter estimates Interpret the statistical significance of the predictors Make substantive interpretation about the findings

Extensions: Regression Multiple regression considers the relation between a set of explanatory variables and response or outcome variable Independent predictor (x 1 ) Outcome (y) Independent predictor (x 2 )

Moderating effect Moderated regression When the independent variable does not affect the outcome directly but rather affects the relationship between the predictor and the outcome. Independent predictor (x 1 ) Outcome (y) Independent variable (x 2 )

Moderating effect Simple Moderating effect When a categorical independent variable affects the relationship between the predictor and the outcome. C1 C2 Y C3 X

Moderating effects Categorical moderator Continuous moderator y = actual scaled score in the Multidimensional Perfectionism Scale (Hewitt & Flett)

Types of moderators (Sharma et al., 1981) Related to predictor and/or outcome Not related predictor and/or outcome No interaction with predictor Independent predictor Homologizer Interaction with predictor variable Quasi-moderator Pure moderator Homologizer variables affect the strength (rather than the form) of the relationship between predictor and outcome (Zedeck, 1971)

Testing Moderation Moderation effects are also known as interaction effects. Interaction terms are product terms of the moderator and the relevant predictor (the variable that the moderator interacts) Y = b 0 + b 1 x 1 + b 2 x 2 + b 3 m Interaction term = x 1 *m =i 1 Choosing the moderator and the relevant predictor must have theoretical support. For example, it is possible that the moderator interacts with x 2 instead (i.e., x 2 *m =i 1 ). Testing for the interaction effect necessitates the inclusion of the interaction term/s in the regression equation: Y = b 0 + b 1 x 1 + b 2 x 2 + b 3 m + b 4 i 1 And test H 0 : b 4 =0

Mediating effect Mediated regression When the independent predictor does not affect the outcome directly but affects it through an intermediary variable (the mediator). Intermediary predictor (x 2 ) Independent predictor (x 1 ) Outcome (y)

Mediation vs Moderation Mediators explain why or how an independent variable X causes the outcome Y while a moderator variable affects the magnitude and direction of the relationship between X and Y (Saunders, 1956). These two approaches can be combined for more complex analyses: Moderated mediation Mediated moderation

Checkists Moderation Collinearity between predictor and moderator (especially true for quasi-moderators). Unequal variances between groups based on the moderator. Reliability of measures (measurement errors are magnified when creating the product terms). Mediation Theoretical assumptions on the mediator Rationale for selecting the mediator Significance and type (full/partial) of the mediation effect. Implied causation (i.e., directional paths).