Chapter VII Measures of Correlation

Similar documents
Chapter 22. Comparing Two Proportions. Copyright 2010 Pearson Education, Inc.

Continuous Data that can take on any real number (time/length) based on sample data. Categorical data can only be named or categorised

Chapter 22. Comparing Two Proportions. Copyright 2010, 2007, 2004 Pearson Education, Inc.

Response Variable denoted by y it is the variable that is to be predicted measure of the outcome of an experiment also called the dependent variable

MOST PEOPLE WOULD RATHER LIVE WITH A PROBLEM THEY CAN'T SOLVE, THAN ACCEPT A SOLUTION THEY CAN'T UNDERSTAND.

1 Inferential Methods for Correlation and Regression Analysis

Chapter 12 Correlation

Chapter If n is odd, the median is the exact middle number If n is even, the median is the average of the two middle numbers

Chapter 1 (Definitions)

INSTRUCTIONS (A) 1.22 (B) 0.74 (C) 4.93 (D) 1.18 (E) 2.43

Overview. p 2. Chapter 9. Pooled Estimate of. q = 1 p. Notation for Two Proportions. Inferences about Two Proportions. Assumptions

Final Examination Solutions 17/6/2010

Common Large/Small Sample Tests 1/55

11 Correlation and Regression

Chapter 13, Part A Analysis of Variance and Experimental Design

TABLES AND FORMULAS FOR MOORE Basic Practice of Statistics

Goodness-of-Fit Tests and Categorical Data Analysis (Devore Chapter Fourteen)

STA Learning Objectives. Population Proportions. Module 10 Comparing Two Proportions. Upon completing this module, you should be able to:

Sample Size Determination (Two or More Samples)

Regression, Inference, and Model Building

Data Analysis and Statistical Methods Statistics 651

Lecture 5: Parametric Hypothesis Testing: Comparing Means. GENOME 560, Spring 2016 Doug Fowler, GS

Power and Type II Error

Describing the Relation between Two Variables

To make comparisons for two populations, consider whether the samples are independent or dependent.

Mathematical Notation Math Introduction to Applied Statistics

ST 305: Exam 3 ( ) = P(A)P(B A) ( ) = P(A) + P(B) ( ) = 1 P( A) ( ) = P(A) P(B) σ X 2 = σ a+bx. σ ˆp. σ X +Y. σ X Y. σ X. σ Y. σ n.

This chapter focuses on two experimental designs that are crucial to comparative studies: (1) independent samples and (2) matched pair samples.

Topic 9: Sampling Distributions of Estimators

Topic 9: Sampling Distributions of Estimators

Read through these prior to coming to the test and follow them when you take your test.

7-1. Chapter 4. Part I. Sampling Distributions and Confidence Intervals

5. A formulae page and two tables are provided at the end of Part A of the examination PART A

S Y Y = ΣY 2 n. Using the above expressions, the correlation coefficient is. r = SXX S Y Y

GG313 GEOLOGICAL DATA ANALYSIS

Properties and Hypothesis Testing

Topic 9: Sampling Distributions of Estimators

Categorical Data Analysis

TABLES AND FORMULAS FOR MOORE Basic Practice of Statistics

Class 27. Daniel B. Rowe, Ph.D. Department of Mathematics, Statistics, and Computer Science. Marquette University MATH 1700

Chapter 22: What is a Test of Significance?

Correlation. Two variables: Which test? Relationship Between Two Numerical Variables. Two variables: Which test? Contingency table Grouped bar graph

A statistical method to determine sample size to estimate characteristic value of soil parameters

Formulas and Tables for Gerstman

Chapter 6 Sampling Distributions

General IxJ Contingency Tables

Investigating the Significance of a Correlation Coefficient using Jackknife Estimates

EXAMINATIONS OF THE ROYAL STATISTICAL SOCIETY

Pearson Edexcel Level 3 Advanced Subsidiary and Advanced GCE in Statistics

2 1. The r.s., of size n2, from population 2 will be. 2 and 2. 2) The two populations are independent. This implies that all of the n1 n2

Because it tests for differences between multiple pairs of means in one test, it is called an omnibus test.

Math 140 Introductory Statistics

Table 12.1: Contingency table. Feature b. 1 N 11 N 12 N 1b 2 N 21 N 22 N 2b. ... a N a1 N a2 N ab

Chapter 13: Tests of Hypothesis Section 13.1 Introduction

BIOS 4110: Introduction to Biostatistics. Breheny. Lab #9

Chapter 5: Hypothesis testing

Correlation and Covariance

Chapter 4 - Summarizing Numerical Data

Expectation and Variance of a random variable

SIMPLE LINEAR REGRESSION AND CORRELATION ANALYSIS

Parameter, Statistic and Random Samples

II. Descriptive Statistics D. Linear Correlation and Regression. 1. Linear Correlation

MATH 320: Probability and Statistics 9. Estimation and Testing of Parameters. Readings: Pruim, Chapter 4

Biostatistics for Med Students. Lecture 2

Stat 139 Homework 7 Solutions, Fall 2015

Issues in Study Design

Statistics 20: Final Exam Solutions Summer Session 2007

Comparing Two Populations. Topic 15 - Two Sample Inference I. Comparing Two Means. Comparing Two Pop Means. Background Reading

MIT : Quantitative Reasoning and Statistical Methods for Planning I

t distribution [34] : used to test a mean against an hypothesized value (H 0 : µ = µ 0 ) or the difference

Lecture 7: Non-parametric Comparison of Location. GENOME 560, Spring 2016 Doug Fowler, GS

Class 23. Daniel B. Rowe, Ph.D. Department of Mathematics, Statistics, and Computer Science. Marquette University MATH 1700

PSYCHOLOGICAL RESEARCH (PYC 304-C) Lecture 9

Sample Size Estimation in the Proportional Hazards Model for K-sample or Regression Settings Scott S. Emerson, M.D., Ph.D.

Correlation Regression

1036: Probability & Statistics

6 Sample Size Calculations

Important Formulas. Expectation: E (X) = Σ [X P(X)] = n p q σ = n p q. P(X) = n! X1! X 2! X 3! X k! p X. Chapter 6 The Normal Distribution.

Chapter two: Hypothesis testing

4 Multidimensional quantitative data

MATH/STAT 352: Lecture 15

Lesson 11: Simple Linear Regression

Stat 200 -Testing Summary Page 1

CEE 522 Autumn Uncertainty Concepts for Geotechnical Engineering

Econ 325 Notes on Point Estimator and Confidence Interval 1 By Hiro Kasahara

UCLA STAT 110B Applied Statistics for Engineering and the Sciences

Chapters 5 and 13: REGRESSION AND CORRELATION. Univariate data: x, Bivariate data (x,y).

If, for instance, we were required to test whether the population mean μ could be equal to a certain value μ

Descriptive measures of association for bivariate distributions

Agreement of CI and HT. Lecture 13 - Tests of Proportions. Example - Waiting Times

Chapter Objectives. Bivariate Data. Terminology. Lurking Variable. Types of Relations. Chapter 3 Linear Regression and Correlation

UNIVERSITY OF TORONTO Faculty of Arts and Science APRIL/MAY 2009 EXAMINATIONS ECO220Y1Y PART 1 OF 2 SOLUTIONS

Lesson 2. Projects and Hand-ins. Hypothesis testing Chaptre 3. { } x=172.0 = 3.67

Bivariate Sample Statistics Geog 210C Introduction to Spatial Data Analysis. Chris Funk. Lecture 7

Worksheet 23 ( ) Introduction to Simple Linear Regression (continued)

Topic 18: Composite Hypotheses

LINEAR REGRESSION ANALYSIS. MODULE IX Lecture Multicollinearity

Chapter 11: Asking and Answering Questions About the Difference of Two Proportions

(7 One- and Two-Sample Estimation Problem )

Statistics Lecture 27. Final review. Administrative Notes. Outline. Experiments. Sampling and Surveys. Administrative Notes

Transcription:

Chapter VII Measures of Correlatio A researcher may be iterested i fidig out whether two variables are sigificatly related or ot. For istace, he may be iterested i kowig whether metal ability is sigificatly related to school performace; whether work performace is sigificatly related to level of morale of the employees; or whether study habits sigificatly relate to grades i mathematics. I all these problems, the researcher ca choose a research desig that will help him establish such relatioships. The simplest thig that he ca do is to study the patters of values of the two variables. This is the area of correlatio aalysis. Basic Cocepts i Correlatio Aalysis Correlatio aalysis is cocered with the aalysis of liear relatioship betwee two or more variables. It is used if oe wats to determie whether the variables are sigificatly related or ot. The relatioship betwee two variables could be positive or egative. A positive liear relatioship betwee two variables exists if a icrease i the value of oe correspods to a icrease i the value of the other. Equivaletly, a positive correlatio exists if a decrease i the value of oe variable correspods to a decrease i the value of the other variable. O the other had, a egative liear relatioship betwee two variables exists if a decrease i the value of oe variable correspods to a icrease i the value of the other variable. Or, if a icrease i the value of oe variable correspods to a decrease i the value of the other, the a egative correlatio exists. Example: Example: It has bee show that IQ ad academic performace are positively related. The followig variables are egatively correlated. 1. academic achievemet ad hours per week of watchig TV. time spet i practice ad umber of typig errors 3. abseteeism rate ad job satisfactio I order to determie the stregth of the associatio or correlatio betwee variables we compute for a statistic kow as correlatio coefficiet. It measures the degree or stregth of the liear relatioship betwee two or more variables. There are several coefficiets of correlatio ad their use depeds o the type of data (e. g. Pearso r, Spearma rho, Cramer s V, Poit-biserial, etc.). Most coefficiets of correlatio assume values betwee -1 ad +1. Qualitative descriptio of the stregth of correlatio is based o the followig suggested guide.

Figure 1. Guide i iterpretig correlatio coefficiets. The Scatter Plot A scatter plot, also kow as scatter diagram, is a graphical represetatio of the liear relatioship betwee two variables. The followig scatter plots show the various types ad degrees of liear associatio betwee two variables. Figure a. Strog positive correlatio Figure b. Low egative correlatio

Figure c. Strog Negative Correlatio Figure d. No correlatio Figure e. Low positive correlatio Spurious Correlatio ad Causality Causality, also kow as causatio, is defied as a cause-effect relatioship betwee two variables. A sigificat correlatio does ot ecessarily idicate causality but rather a commo likage i a sequece of evets. Oe type of sigificat correlatio situatio is whe both variables are iflueced by a commo cause (cofoudig variable) ad therefore are correlated with each other (Spurious correlatio). For example, idividuals with a higher level of icome have both higher levels of savigs ad spedig. We might fid that there is a positive correlatio betwee level of savigs ad level of spedig but this does ot mea that oe variable causes the other. Moore (1993) foud a sigificat strog correlatio betwee amout of ice cream sold ad the umber of deaths by drowig. But this does ot mea that eatig ice cream causes death by drowig or the other way aroud. This sigificat correlatio is actually due to seaso (summer time). If a researcher wats to show cause-ad-effect, he should coduct a cotrolled experimet.

Correlatio Betwee Two Iterval Variables (Pearso r) The Pearso product-momet correlatio coefficiet, popularly kow as Pearso s r, is the most widely used correlatio coefficiet. Values of r for pairs of variables are commoly reported i research reports ad jourals as a meas of summarizig the extet of the relatioship betwee two variables measured i at least iterval scale. This coefficiet is appropriate if observatios are sampled from a bivariate ormal distributio. Give sample data, Pearso r is computed usig the followig formula: r i1 X i X iyi X iyi i1 i1 i1 i1 X i i1 Y Testig for the Sigificace of Pearso s r Whe computig a correlatio coefficiet, it is also useful to test the correlatio coefficiet for sigificace. This provides the researcher with some idea of how large a correlatio coefficiet must be before cosiderig it to demostrate that there really is a relatioship betwee two variables. It may be that two variables are related by chace, ad a hypothesis test for r allows the researcher to decide whether the observed r could have emerged by chace or ot. I order to test the correlatio coefficiet for statistical sigificace, it is ecessary to defie the true correlatio coefficiet that would be observed if all populatio values were available. This true correlatio coefficiet is usually deoted by the Greek letter ρ (rho). The ull hypothesis is that there is o relatioship betwee the two variables X ad Y. I symbol, H0: = 0 The ull hypothesis may be tested agaist ay oe of the followig alterative hypotheses: a) Ha: 0 b) Ha: > 0 c) Ha: < 0 The test statistic for the hypothesis test above is the sample or observed correlatio coefficiet r. As various samples are draw, each of sample size, the values of r vary from sample to sample. The samplig distributio of r is approximated by a t distributio with - degrees for freedom. Hece, the test statistic for testig for the sigificace of r is give by r t 1 r i i1 Y i

Example: The followig table presets data o years of formal educatio (X) ad age of etry ito the labour force (Y) of a radom sample of 1 workers. X Y X Y XY 10 16 100 56 160 1 17 144 89 04 15 18 5 34 70 8 15 64 5 10 0 18 400 34 360 17 89 484 374 1 19 144 361 8 15 5 484 330 1 18 144 34 16 10 15 100 5 150 8 18 64 34 144 10 16 100 56 160 X 149 Y 14 X 1999 Y 3876 XY 716 r r X XY X X Y Y 1716 149 14 149 1 3876 1 1999 14 Y 0. 641 Therefore, there is a moderate positive correlatio betwee years of formal educatio (X) ad age of etry ito the labour force (Y). Computig ad testig the sigificace of Pearso r usig Stata 1. The data shall be typed i Stata Editor (or i Excel), the way it appears i the table above (oly the colums of X ad Y).

. From the meu at the top of the scree, click o StatisticsSummaries, tables, ad testssummary ad descriptive statisticspairwise correlatios. 3. I the Mai tab of the dialog box, select X ad Y i the pull-dow meu. 4. Check the boxes opposite Prit sigificace level for each etry ad sigificace level for displayig with a star. 5. Click OK. SCREENSHOTS

OUTPUT. pwcorr x y, sig star(5) x x 1.0000 y Commad Pearso r y 0.641* 1.0000 0.0301 p-value Iterpretatio: There is a sigificat moderate positive correlatio betwee years of formal educatio ad age of etry ito the labour force (r=0.641, p=0.0301). Correlatio Betwee Two Ordial Variables (Spearma rho) If both variables are measured i the ordial scale, the the most appropriate correlatio coefficiet is the Spearma rak coefficiet, popularly kow as Spearma rho. The value of the Spearma rho coefficiet is calculated based o the raks of the observatios. The computatioal formula of Spearma s rho together with the test statistic for testig its sigificace are give by 6 d r s 1 3 where d = paired differeces betwee the raks. = the umber of paired raks.

t r s 1 rs Example: Thus, Suppose that a researcher wats to fid out the relatioship betwee job performace of 6 employees (measured i a scale of 1 to 10) ad their job satisfactio (also measured i a scale of 1 to 10). The hypothetical data is give below. Performace Satisfactio Differece (d) d 10 (1) 8 (3) 4 8 (3) 9 () 1 1 9 () 10 (1) 1 1 4 (6) 5 (5) 1 1 5 (5) 3 (6) 1 1 6 (4) 6 (4) 0 0 r s d 8 6( 8 ) 48 1 1 0. 7714 there exists a strog liear relatioship 3 6 6 10 betwee performace ad satisfactio Computig ad testig the sigificace of Spearma rho usig Stata 1. The data whe typed i Stata Editor (or i Excel) will look like this:. From the meu at the top of the scree, click o StatisticsNoparametric aalysistests of hypothesesspearma s rak correlatio. 3. I the Mai tab of the dialog box, select X ad Y i the pull-dow meu. 4. Check the boxes opposite Prit sigificace level for each etry ad sigificace level for displayig with a star. 5. Click OK.

SCREENSHOTS

OUTPUT. spearma performace satisfactio, star(0.05) Number of obs = 6 Spearma's rho = 0.7714 Commad Spearma rho Test of Ho: performace ad satisfactio are idepedet Prob > t = 0.074 p-value Iterpretatio: There is a strog liear relatioship betwee performace ad satisfactio(rs=0.7714). However, the correlatio is ot sigificat (p=0.074). The strog liear associatio betwee these two variables may oly be due to chace. Or, maybe because of very small sample size. Correlatio Betwee a Iterval Variable ad a Two-category Nomial Variable If oe of the two variables i the aalysis is dichotomous (two-category omial) ad the other is iterval i scale, the poit-biserial correlatio coefficiet is appropriate. The computatioal formula poit biserial coefficiet is give by y1 y 0 pq rpbi s where: y 1 y 0 y - mea of the Y scores for those idividuals with X scores equal to 1 - mea of the Y score for those idividuals with X scores equal to 0 s - stadard deviatio of all Y scores y p - proportio of idividuals with a X score equal to 1 q - proportio of idividuals with a X score equal to 0 Similar to the Pearso r ad the Spearma rho, the sigificace of the poit-biserial coefficiet is tested usig a T test. Example Cosider the dexterity scores of a radom sample of male ad female respodets. Let 1 deote male ad 0 deote female. Subject Sex Dexterity 1 1 11 1 13 3 1 1 4 1 13 5 1 10 6 0 10 7 0 7 8 0 9 9 0 6 10 0 7

Solutio: The stadard deviatio of dexterity scores is s y. 59813; p = 0.5 ad q = 0.5. Hece, r pbi r pbi y 1 y s 0 y pq 11. 8 7. 8 0. 50. 5. 59818 0. 7906 Therefore, there exists a high correlatio betwee sex ad dexterity scores. Males ted to have higher dexterity tha females. Computig ad testig the sigificace of Poit-biserial coefficiet usig Stata 1. Type the data Stata Editor (or i Excel) as it appears i the table above.. Istall the poit-biserial (pbis) package. I the Commad widow, type fidit pbis. Hit Eter. Follow istructios below. Wait util istallatio is complete.

Click o this lik 3. I the Commad widow, type pbis sex dexterity. Hit Eter.. pbis sex dexterityscore OUTPUT Commad (obs= 10) Np= 5 p= 0.50 Nq= 5 q= 0.50 Poit-biserial coefficiet p-value ------------------+------------------+------------------+------------------+ Coef.= 0.7906 t= 3.6515 P> t = 0.0065 df= 8

Iterpretatio: There exists a sigificat strog correlatio betwee sex ad dexterity scores. Further, males have sigificatly higher dexterity tha females. Correlatio Betwee Two Dichotomous Nomial Variables The Phi () coefficiet is useful whe both variables are omial i scale or categorical with two categories (dichotomous). I this case, the data is arraged i a x cotigecy table. The computatioal formula of phi coefficiet together with the test statistic for testig its sigificace are give by t bc ad a bc d a cb d 1 Example: Cosider the problem of determiig the correlatio betwee sex ad political party affiliatio. Suppose we have 10 subjects with the followig data: Perso Sex (X) Party (Y) A 1 1 B 1 1 C 1 0 D 1 1 E 1 1 F 0 0 G 0 1 H 0 1 I 0 0 J 0 0 where: sex - 1 if male; 0 if female party 1 if Republica; 0 if Democrat The correspodig x cotigecy table is give below. Party (Y) Sex (X) Total 0 1 1 4 6 0 3 1 4 Total 5 5 10 The, 43 1 64 5 5 0. 408

Therefore, there exists a relatively moderate positive associatio betwee sex ad political party affiliatio. Computig ad testig the sigificace of Phi coefficiet usig Stata 1. Type the data Stata Editor (or i Excel) as it appears below.. Istall the Phi coefficiet (phi) package. I the Commad widow, type fidit phi. Hit Eter. Scroll dow to this lik: sp3 from http://www.stata.com/stb/stb3. Click o this lik

Wait util istallatio is complete. 3. I the Commad widow, type phi sex party. Hit Eter.. phi sex party OUTPUT Commad Party (Y) Sex (X) 0 1 Total 0 3 5 1 1 4 5 Total 4 6 10 p-value Phi coefficiet Pearso chi(1) = 1.6667 Pr = 0.197 phi = Cohe's w = fourfold poit correlatio = 0.408 phi-squared = 0.1667 Iterpretatio: There exists a o-sigificat moderate associatio betwee sex ad political party affiliatio ( =0.408, p=0.197). There is o sufficiet data to coclude that oe sex category is domiatly affiliated with a political party. REMARK: The value of the Phi coefficiet ca also be computed based i the value of the chi-square test statistic. It is calculated as X. N

Chi-square-based Measures of Associatio for Categorical Variables The chi-square test for idepedece of two categorical variables is used to determie whether the variables are associated or ot. However, it does ot provide iformatio o the stregth of the associatio. The followig are some measures of (oliear) associatio betwee variables based o the chi-square test. The sigificace of these measures follows that of the chi-square test statistic. a) Cramer s V The Cramer s V is appropriate if the dimesio of the cotigecy table is larger tha a x ad the umber of rows is ot equal to the umber of colums. This is cosidered as a alterative for phi coefficiet. The value of V ca be calculated based o the chi-square statistic as follows: X V NL 1 where X is the value of the chi-square statistic N = the grad total L = the umber of rows or colums, whichever is smaller Example: Suppose that we are iterested i the relatioship betwee work performace problems of employees ad their use of three treatmet cliics. The data are give below. Type of Problem Type of Cliic Total Cardiovascular Metal Alcoholism Health Abseteeism 7 13 8 48 Tardiess 15 14 11 40 Poor Quality Work 15 1 14 41 Problems with Other Workers 3 16 4 43 Total 60 55 57 17 Aalysis usig Stata 1. Type the data Stata Editor (or i Excel) as it appears below.

. Type the followig commad i the Commad widow: tabulate problem cliic [fweight=um_workers], chi V. Press Eter. OUTPUT. tabulate problem cliic [fweight=um_workers], chi V cliic problem Cardiovas Metal he Alcoholis Total Abseteeism 7 13 8 48 Tardiess 15 14 11 40 Poor quality of work 15 1 14 41 Problem with co-worke 3 16 4 43 Total 60 55 57 17 Pearso chi(6) = 7.981 Pr = 0.000 Cramér's V = 0.849 Iterpretatio: There exists a sigificatly low associatio betwee work performace problem ad type of treatmet cliic (V=0.849, p=0.000). Majority of workers with problems i abseteeism usually atted to Cardiovascular treatmet; while, workers with problems with their co-workers usually atted to Alcoholism treatmet cliic. b) Cotigecy coefficiet C The Cotigecy C is appropriate if the dimesio of the cotigecy table is larger tha a x ad the umber of rows is equal to the umber of colums. The value of C is calculated as: X C X N where X is the value of the chi-squared statistic N = the grad Ufortuately, there is o available Stata package for computig the value of C. Its value has to be computed maually. Example: A radom sample of 00 married me, all retired, were classified accordig to educatio ad umber of childre. Is there a sigificat associatio betwee the size of the family ad the level of educatio attaied by the father? Educatio Number of Childre 0-1 -3 Over 3 Total Elemetary 14 37 3 83 Secodary 19 4 17 78 College 1 17 10 39 Total 45 96 59 00

Aalysis usig Stata 1. Type the data Stata Editor (or i Excel) as it appears below.. Type the followig commad i the Commad widow: tabulate educ umchild [fweight=um_me], chi. Press Eter. OUTPUT. tabulate educ umchild [fweight=um_me], chi umchild educ 0-1 -3 Over 3 Total Elemetary 14 37 3 83 High School 19 4 17 78 College 1 17 10 39 Total 45 96 59 00 Pearso chi(4) = 7.4644 Pr = 0.113 3. To geerate the value of C, type the commad: display sqrt(7.4644/(7.4644+00)) C= 0.1896818 Iterpretatio: There exists a o-sigificat very low associatio betwee the level of educatio of the father ad the umber of childre of the family (C=0.1897, p=0.1130). Other measures of associatio 1. Gamma coefficiet. Kedall s rak correlatio 3. Eta coefficiet