The topics in this section concern with the second course objective. Correlation is a linear relation between two random variables.

Similar documents
/ n ) are compared. The logic is: if the two

Correlation and Regression. Correlation 9.1. Correlation. Chapter 9

Chapter 9: Statistical Inference and the Relationship between Two Variables

Statistics for Managers Using Microsoft Excel/SPSS Chapter 13 The Simple Linear Regression Model and Correlation

Scatter Plot x

Statistics for Economics & Business

[The following data appear in Wooldridge Q2.3.] The table below contains the ACT score and college GPA for eight college students.

Here is the rationale: If X and y have a strong positive relationship to one another, then ( x x) will tend to be positive when ( y y)

Resource Allocation and Decision Analysis (ECON 8010) Spring 2014 Foundations of Regression Analysis

2016 Wiley. Study Session 2: Ethical and Professional Standards Application

Chapter 11: Simple Linear Regression and Correlation

Section 8.3 Polar Form of Complex Numbers

STATISTICS QUESTIONS. Step by Step Solutions.

Statistics for Business and Economics

SIMPLE LINEAR REGRESSION

Basic Business Statistics, 10/e

3.1 Expectation of Functions of Several Random Variables. )' be a k-dimensional discrete or continuous random vector, with joint PMF p (, E X E X1 E X

Lecture 6: Introduction to Linear Regression

STAT 3008 Applied Regression Analysis

Statistics MINITAB - Lab 2

Economics 130. Lecture 4 Simple Linear Regression Continued

Week3, Chapter 4. Position and Displacement. Motion in Two Dimensions. Instantaneous Velocity. Average Velocity

UNIVERSITY OF TORONTO Faculty of Arts and Science. December 2005 Examinations STA437H1F/STA1005HF. Duration - 3 hours

Department of Quantitative Methods & Information Systems. Time Series and Their Components QMIS 320. Chapter 6

Chapter 3 Describing Data Using Numerical Measures

Lecture Notes for STATISTICAL METHODS FOR BUSINESS II BMGT 212. Chapters 14, 15 & 16. Professor Ahmadi, Ph.D. Department of Management

Chapter 14 Simple Linear Regression

First Year Examination Department of Statistics, University of Florida

APPENDIX 2 FITTING A STRAIGHT LINE TO OBSERVATIONS

j) = 1 (note sigma notation) ii. Continuous random variable (e.g. Normal distribution) 1. density function: f ( x) 0 and f ( x) dx = 1

Comparison of Regression Lines

Econ107 Applied Econometrics Topic 3: Classical Model (Studenmund, Chapter 4)

Statistics II Final Exam 26/6/18

Chapter 6. Supplemental Text Material

1. Inference on Regression Parameters a. Finding Mean, s.d and covariance amongst estimates. 2. Confidence Intervals and Working Hotelling Bands

STAT 511 FINAL EXAM NAME Spring 2001

Gravitational Acceleration: A case of constant acceleration (approx. 2 hr.) (6/7/11)

ECONOMICS 351*-A Mid-Term Exam -- Fall Term 2000 Page 1 of 13 pages. QUEEN'S UNIVERSITY AT KINGSTON Department of Economics

Joint Statistical Meetings - Biopharmaceutical Section

Department of Statistics University of Toronto STA305H1S / 1004 HS Design and Analysis of Experiments Term Test - Winter Solution

This column is a continuation of our previous column

ISQS 6348 Final Open notes, no books. Points out of 100 in parentheses. Y 1 ε 2

Psychology 282 Lecture #24 Outline Regression Diagnostics: Outliers

Chapter 13: Multiple Regression

28. SIMPLE LINEAR REGRESSION III

Lecture 9: Linear regression: centering, hypothesis testing, multiple covariates, and confounding

Lecture 9: Linear regression: centering, hypothesis testing, multiple covariates, and confounding

Statistics Chapter 4

Unit 5: Quadratic Equations & Functions

Lecture 3: Probability Distributions

x = , so that calculated

The Multiple Classical Linear Regression Model (CLRM): Specification and Assumptions. 1. Introduction

Answers Problem Set 2 Chem 314A Williamsen Spring 2000

Dr. Shalabh Department of Mathematics and Statistics Indian Institute of Technology Kanpur

ANSWERS. Problem 1. and the moment generating function (mgf) by. defined for any real t. Use this to show that E( U) var( U)

Chapter 5. Solution of System of Linear Equations. Module No. 6. Solution of Inconsistent and Ill Conditioned Systems

Midterm Examination. Regression and Forecasting Models

18. SIMPLE LINEAR REGRESSION III

Chapter 8 Indicator Variables

Linear Feature Engineering 11

A Robust Method for Calculating the Correlation Coefficient

Kernel Methods and SVMs Extension

Negative Binomial Regression

x yi In chapter 14, we want to perform inference (i.e. calculate confidence intervals and perform tests of significance) in this setting.

Durban Watson for Testing the Lack-of-Fit of Polynomial Regression Models without Replications

Global Sensitivity. Tuesday 20 th February, 2018

Module 9. Lecture 6. Duality in Assignment Problems

10-701/ Machine Learning, Fall 2005 Homework 3

Linear Regression Analysis: Terminology and Notation

4 Analysis of Variance (ANOVA) 5 ANOVA. 5.1 Introduction. 5.2 Fixed Effects ANOVA

Analytical Chemistry Calibration Curve Handout

Structure and Drive Paul A. Jensen Copyright July 20, 2003

1 Matrix representations of canonical matrices

The Order Relation and Trace Inequalities for. Hermitian Operators

January Examinations 2015

The Geometry of Logit and Probit

Numerical Heat and Mass Transfer

Biostatistics. Chapter 11 Simple Linear Correlation and Regression. Jing Li

Homework Assignment 3 Due in class, Thursday October 15

β0 + β1xi. You are interested in estimating the unknown parameters β

Math1110 (Spring 2009) Prelim 3 - Solutions

ANOVA. The Observations y ij

Q1: Calculate the mean, median, sample variance, and standard deviation of 25, 40, 05, 70, 05, 40, 70.

THE SUMMATION NOTATION Ʃ

x i1 =1 for all i (the constant ).

AP Physics 1 & 2 Summer Assignment

Lecture 3 Stat102, Spring 2007

CORRELATION AND REGRESSION

CHAPTER 8. Exercise Solutions

Chapter 12 Analysis of Covariance

Composite Hypotheses testing

Section 3.6 Complex Zeros

Learning Objectives for Chapter 11

Properties of Least Squares

CHAPTER 14 GENERAL PERTURBATION THEORY

a. (All your answers should be in the letter!

Laboratory 3: Method of Least Squares

Introduction to Regression

Chapter 2 - The Simple Linear Regression Model S =0. e i is a random error. S β2 β. This is a minimization problem. Solution is a calculus exercise.

Feature Selection: Part 1

Transcription:

4.1 Correlaton The topcs n ths secton concern wth the second course objectve. Correlaton s a lnear relaton between two random varables. Note that the term relaton used n ths secton means connecton or relatonshp and s not the mathematcal term or concept of relaton as n relatons and functons. However, the term functon used n ths secton and n ths book s the mathematcal term and concept of functon. The most common parameter used to ndcate the correlaton between two random varables s the Pearson correlaton coeffcent (or smply correlaton coeffcent). The notaton for ths correlaton coeffcent s ρ (rho). Often, the correlaton coeffcent of two varables X and Y s denoted as ρ X,Y. The correlaton s expressed by a number between 1 and 1 wth ths correlaton coeffcent; that s, -1 ρ 1. The value of 0 for the correlaton (that s, ρ X,Y = 0) means that there s no correlaton between the two varables X and Y. If ρ X,Y = -1, then the two varables have the perfect negatve correlaton whle, f ρ X,Y = 1, then the two varables have the perfect postve correlaton. For example, a random system outputs a crcle, and the sze of a crcle produced by ths system s of nterest. That s, the output of the system s a crcle, and the property of nterest s the sze of a produced crcle. The system cannot produce crcles of a unform (constant) sze. So, the crcle sze s random (and, thus, ths system s a random system). By the way, the nput of the system s a radus for the crcle to be produced by the system. As stated above, the system cannot produce the same szed crcles for a constant nput of the radus. There are several ways of measurng the property of the output from the system. For nstance, the sze of a crcle can be measured by ts area (nches squared) or dameter (nches). That s, there are two ways of measurng the property of nterest; namely, the sze of a crcle. 1

Let W be the random varable of the crcle area and Y be the random varable of the crcle dameter. Then, there s a relaton between the two random varables W and Y, gven by W = (0.25) π Y 2. That s, the varable W has a quadratc relaton wth the varable Y snce W s a quadratc functon of Y. Ths quadratc relaton s not a correlaton because the relaton s not lnear. For them to be correlated to each other, the relaton between the two varables must be lnear. By the way, note that these varables are random and the randomness comes from the randomness of the output of the random system. However, the (quadratc) relaton of W and Y s not random. There s another measure for the sze of crcle, the crcumference (nches). Let X be the random varable of the crcle crcumference. Then, there s a relaton between W and X gven as W = 0.25 π X2 Agan, the varable W has a quadratc relaton wth the varable X, but they have no correlaton. On the other hand, the two random varables X and Y have a relaton gven as X = π Y or Y = 1 π X. I hope you recognze, from the prerequste, that the last equaton as the slope-ntercept form of the lnear equaton wth a postve slope of 1/π wth the orgn for the y-ntercept. That s, Y s a lnear functon of X, and, hence, the two random varables X and Y have a lnear relaton. 2

These random varable X and Y have a correlaton and, n fact, X and Y are perfectly postvely correlated, whch means ρ X,Y = 1. The pont of ths Crcle Example s that a correlaton must be a lnear relaton of two random varables. The slope of the lnear equaton s not the value of the correlaton coeffcent snce the slope of a lne can be greater than 1 or less than -1. However, the sgn of the slope s the sgn of the correlaton. For nstance, the sgn of the slope for the lnear equaton n the Crcle Example s 1/π or π, whch s postve, and, hence, the correlaton of X and Y s postve (recall ρ X,Y = 1). Namely, f X ncreases as Y ncreases (the same movements), then X and Y have a postve correlaton, lke the X (the crcumference of a crcle) and the Y (the dameter of the crcle). On the other hand, f one varable decreases as another varable ncreases (opposte movements), then they have a negatve correlaton. If we know the relaton of two random varables (especally, mathematcally) lke the crcumference and the dameter of a crcle, then t s obvous and straghtforward. However, n practce, t s often not obvous or straghtforward. For nstance, the relaton between the malleablty of steel and the rate of the annealng temperature drop s not completely scentfcally (and mathematcally) understood, and no clear mathematcal equaton (formula) exsts for them whle they are known to have a certan relaton. Even f a lnear relaton s known for two random varables, t s not straghtforward to use the lnear relaton n practce. For nstance, you need to measure the volume of orange that you use n your food processng busness because your orange processng machne has the maxmum volume for the oranges. So, t cannot process oranges whch are too large. Measurng volumes of orange for many oranges s more complcated, more laborous, and more expensve that measurng ther weghts. The volume V and the weght W have a lnear relaton. In fact, the volume V s a lnear functon of the weght W gven as 3

V = 1 d W where d s the densty of the materal (oranges n ths example). If the densty d s constant for all the oranges that you receve from your vendors (supplers), then the volume V s computed from the lnear equaton gven above by weghng oranges (that s, from W). In practce, the problem s that, whle t s close, the densty s not constant for all oranges snce dfferent trees produce oranges of dfferent denstes. Even one tree produces oranges of dfferent denstes. If the values of the densty are close enough for oranges that you use, then the correct decson s to measure orange weghts, not orange volumes (so that a lot of tme and money are saved, and the proft ncreases) and compute ther volumes. If the values of the densty are all over, then the correct decson s to stay wth the volume measurement. In order to make the correct decson (whchever t s), we take data and estmate the correlaton between V and W by usng the data. That s, n ths Orange Sze Example, the correlaton (or equvalently the correlaton coeffcent) s the parameter, and ts value must be estmated or tested by hypothess testng to make the correct decson. As volume measurements are taken, the weght measurements of oranges are also taken. These volume and weght measurements form data n ordered pars as (W, V). Each ordered par corresponds to one orange, and ts weght and volume are gven n the frst and second postons of an ordered par, respectvely, as (W, V). If data are collected from 300 oranges, then there should be 300 ordered pars wth the total of 600 measurements whch are 300 weght measurements and ther correspondng 300 volume measurements. A weght measurement cannot be pared up wth any volume measurement arbtrarly. It must be pared up wth the volume measurement from the same orange. It should be noted that these ordered pars can be put (V, W), nstead of (W, V), just as well snce we are nvestgatng a lnear relaton of V and W. If V s a lnear functon of W, then W s a lnear functon of V, and vce versa. However, the order of the measurements (varables) must be consstent 4

throughout data. In other words, you cannot have some (V, W) s and (W, V) s n the same data; they are ether all (V, W) s or all (W, V) s. Any correlaton s nvestgated or not, ths knd of data s called bvarate data. Generally, data for correlaton means bvarate data. Bvarate data are data whose measurements (observatons) came from two random varables n ordered pars. The sample sze n of bvarate data s the number of the pars n the data, not the total number of observatons or measurements (whch s 2n for bvarate data of sze n). Ths s because, for bvarate data, each par s consdered to be a datum. The sample sze s the sze of data whch s the number of datums and, for bvarate data, t s the number of the pars. Besdes, there were 300 oranges sampled, and a par of measurements were, then, taken from each orange. The orgnal meanng of sample sze s the number of objects (oranges) sampled from a populaton. These pars are ordered pars and ponts n a two-dmensonal plane. Thus, they are often referred to as data ponts. The sample sze n s the total number of data ponts n data. Ths s applcable for mult-varate data n general. When a relaton of two varables s nvestgated, a certan knd of graph s very commonly used. It s the scatter plot. A scatter plot conssts of one horzontal axs (real number lne) and one vertcal axs (real number lne) ntersectng each other at the orgns; just lke the x-y plane but the axes are not necessarly x- and y-axes. An ordered par n data from two varables s a pont n such a plane wth ts frst (horzontal) coordnate for the horzontal axs and ts second coordnate for the vertcal axs n the ordered par. That s, ordered pars are plotted n a scatter plot as ponts. All ths should be famlar to you from the prerequste. For nstance, f 30 crcles are measured n the Crcle Example and 30 ordered pars of (Y, W) s are obtaned as data, then they are plotted n a scatter plot as 30 ponts. Ths scatter plot conssts of the Y-axs as the horzontal axs and the W-axs as the vertcal axs. That s, ths scatter plot 5

s a Y-W plane wth 30 ponts on t. These 30 ponts are all located on a parabola, gong through the orgn, n the frst quadrant snce Y 0 and W 0. If 30 ordered pars of (X, Y) are obtaned as data n the Crcle Example, then they are plotted n a scatter plot as 30 ponts. Ths scatter plot s an X-Y plane wth 30 ponts all on a straght lne, startng at the orgn wth a postve slope of 1/ π, n the frst quadrant snce X 0 and Y 0. These ponts must be all on the straght lne assumng no measurement error. Generally, f ponts are all on (or closely clustered around) a straght lne of a postve slope n a scatter plot, the correlaton of the two knds of measurements s 1; that s, the perfect postve correlaton or a strong postve correlaton exsts between the two varables. If ponts are all on (or closely clustered around) a straght lne of a negatve slope n a scatter plot, the correlaton of the two knds of measurements s 1; that s, the perfect negatve correlaton or a strong negatve correlaton exsts between the two varables. If ponts are all scatter around lke shotgun pellets, then the correlaton of the two knds of measurements s 0 or close to 0; that s, no correlaton or neglgble correlaton exsts between the two varables. See the dagrams of scatter plots gven below. ρ = 1 ρ = -1 6

ρ = 0 These scatter plots were found at the followng webste whch s no longer avalable. http://www.mste.uuc.edu/courses/c330ms/youtsey/scatternfo.html Please read Appendx: Scatter Plots gven below for more about scatter plots. If the 300 ordered pars of (W, V) are plotted n a scatter plot, the 300 ponts should be clustered along the graph of V = W/d n the frst quadrant snce W 0 and V 0. Note that these ponts do not spread out along the lne fllng out the frst quadrant snce the szes of oranges are relatvely unform; that s the range of W s small. However, f you zoom n to where all the ponts are located, the ponts should cluster along the short pece of the lne. A scatter plot s useful to recognze a relaton (or lack of relaton) between two varables. If the data of the steel malleablty and the rate of the annealng temperature drop are plotted, a relaton (may not be lnear) between them can be found. It mght show no relaton (a shotgun shot) between them. However, ths would be a very mportant and useful pece of nformaton because t means the rate of the annealng temperature drop 7

would have nothng to do wth the steel malleablty. That s, you could not control the steel malleablty by the annealng temperature drop. You would have to fnd the other factors that have relatons wth steel malleablty. You can construct a scatter plot by usng Excel or gong to the webste, http://www.shodor.org/nteractvate/actvtes/scatterplot/ You enter data as one column of measurements from one varable and another column of measurements from another varable, separatng between two measurements n each row by one blank (space). Do not enter bvarate data as ponts (that s, do not use parentheses). Try t out. Even f a scatter plot shows tghtly clustered ponts and ndcates a sgnfcant correlaton, the scatter plot does not estmate the value of the correlaton ρ. Whle a scatter plot mght suggest some correlaton (or lack of t) vsually, t does not provde a numercal estmate for ρ. When an estmate of ρ s needed, the sample correlaton coeffcent s used. Its formula s ˆρ x, y = n (x - x)(y - y) =1, (n - 1)(s )(s ) x y where n s the number of ordered pars, x and y are sample averages of data from varables X and Y respectvely, and s x and s y are the sample standard devatons of data from varables X and Y respectvely. Sometmes, r s used for ˆρ x, y. To understand the formula of ˆρ x, y, read Appendx: Formula of ˆρ x, y gven below. It s very mportant to read and understand what s explaned n the appendx. If you do, you understand the formula. You do not have to memorze t; you can remember or recall t correctly and use t correctly. Let us check on a couple of ponts about ˆρ x, y whch helps you understand ˆρ x, y and the formula. If Y = X, then ρ x,y = ρ x,x whch s 1. Let us see what happens to 8

ˆρ x, y = n (x - x)(y - y) =1. (n - 1)(s )(s ) x y ˆρ x, y = ˆρ x, x = n (x - x)(x - x) =1 (n - 1)(s )(s ) x x = n (x - x) =1 x 2 (n - 1)(s ) 2 = 1 (s ) x 2 n (x -x) =1 (n - 1) 2 snce n (x -x) =1 (n - 1) 2 = (s x ) 2 = 1 (s 2 x) (s x) 2 = 1. That s, ˆρ x, x = 1, whch makes sense and should be snce ρ x,x = 1. Now, f Y = -X, then ρ x,y = ρ x,-x whch s -1. Let us see what happens to ˆρ x,y. ˆρ x, y = ˆρ x, -x = n (x - x)(-x - (-x)) =1 (n - 1)(s )(s ) x x = n (x - x)(-1)(x - x) =1 (n - 1)(s ) x 2 = (-1) n (x - x) =1 x 2 (n - 1)(s ) 2 = (-1) ˆρ x, x = (-1)(1) = -1. That s, ˆρ x, -x = -1, whch makes sense and should be snce ρ x,-x = -1. It s mportant to understand what we have just done snce, f you understand t, you understand the formula of ˆρ x, y, and you can remember t and use t correctly. 9

Let us have a numercal example for the sample correlaton coeffcent wth small data of n = 4, {{ (3, 8.5), (12, 2.3), (6, 6.4), (7, 4.8) }}, whch can be gven n a table as X Y 3 8.5 12 2.3 6 6.4 7 4.8 To compute ˆρ x, y, let us fnd s x and s y frst. x = (3 + 12 + 6 + 7)/4 = 7 and y = (8.5 + 2.3 + 6.4 + 4.8)/4 = 5.5. (x - x ) (x - x ) 2 (y - y ) (y - y ) 2 3-7 = -4 16 8.5 5.5 = 3 9 12 7 = 5 25 2.3 5.5 = -3.2 10.24 6 7 = -1 1 6.4 5.5 = 0.9 0.81 7 7 = 0 0 4.8 5.5 = -0.7 0.49 Total 42 Total 20.54 So, s x = 42 (4 1) = 3.74 and s y = 20.54 (4 1) = 2.62. Also, (x - x )(y - y ) (3 7)(8.5 5.5) = -12 (12 7)(2.3 5.5) = -16 (6 7)(6.4 5.5) = -0.9 (7 7)(4.8 5.5) = 0 Total - 28.9 10

So, the estmate for ρ s ˆρ x, y = n (x - x)(y - y) =1 (n - 1)(s )(s ) x y = 28.9 (4 1)(3.74)(2.62) = -0.98. Ths estmate ndcates that the true value of ρ s close to 1 and that the two varables X and Y have a strong negatve correlaton. The data used as an example above are small for the smplcty sake. In realty, should be used large data (30 data ponts or larger). To compute sample correlaton coeffcent, the followng webstes can be used. http://www.alcula.com/calculators/statstcs/correlaton-coeffcent/ or http://www.socscstatstcs.com/tests/pearson/default2.aspx Generally, a value of ˆρ x, y greater than 0.9 ndcates a strong postve correlaton between the two varables whle a value of ˆρ x, y less than 0.9 ndcates a strong negatve correlaton between the two varables. For large data, hypothess testng can be conducted on the correlaton between two varables, typcally as Ho: There s no correlaton between the two varables. vs Ha: There s some correlaton between the two varables. Please read Appendx: Hypothess Testng on Correlaton gven below. I would take more than 100 pars of data of volumes and weghts n the Orange Sze Example and conduct ths hypothess test at 1% sgnfcance level. There are computer packages that would do sgnfcance testng on correlaton. Wth the p-value from the sgnfcance test, I can fnsh the 11

hypothess testng on the correlaton between the orange volume (V) and the orange weght (W), as dscussed n the last secton of the last chapter. If the null hypothess s rejected, then I fnd the average densty d of the oranges that I used for the hypothess testng. Then, I start measurng ther weghts, nstead of ther volumes. The volume of an orange can be computed from ts weght by V = W/ d. I would randomly select oranges (say, 1% on average) as they come n and stll take volume measurements of them. Ths way, I can compute runnng correlaton between V and W so that I can montor any changes n the correlaton. At the same tme, I can reduce the cost of measurng orange szes and ncrease the proft. It s mportant to know that some (or even a strong) correlaton between two varables does not mean that there exsts a cause-effect relaton between them. Recall, a cause-effect relaton s a relaton of a set of varables (factors) causng some effect on the other varable (response varable). In fact, the varables causng effect on the other varables are called factors and the affected varables are called the response varables n experments. Data come from response varables (measured or observed values of the response varable) n experments. To fnd a cause-effect relaton among varables, an experment must be conducted. You can fnd a relaton among varables such as correlaton by observatonal studes, but an experment wth randomzaton s necessary to establsh a cause-effect relaton. A lnear relaton between two varables s also nvestgated by the smple regresson analyss. The dfference between the smple regresson and the correlaton s that one of the varables s a random varable but another varable s constant varable (as opposed to a random varable) n the smple regresson whle both varables are random varables for a correlaton. Fnally, sample covarance 12

ĈOV(X, Y) = n (x - x)(y - y) =1 (n - 1) estmates the (populaton) covarance COV(X, Y). Ths parameter of covarance ndcates the lnear relaton between two random varables, X and Y, just lke the correlaton coeffcent. However, ts values are not standardzed from -1 to 1 as the correlaton coeffcent. Also, unlke the correlaton coeffcent, the (populaton) covarance has a physcal dmenson, (the dmenson of X)*(the dmenson of Y). Ths s the reason why the correlaton coeffcent ρ X,Y s consdered to be a better parameter and more wdely used for ndcatng a lnear relaton between two random varables. The sample covarance has the same drawbacks as the populaton covarance. Look at the formula gven above. The absence of s x and s y from the denomnator results n an estmate wth a physcal dmenson of (the dmenson of X)*(the dmenson of Y). Also, the value of an estmate s not standardzed between -1 and 1. Incdentally, these drawbacks make the sample covarance a good estmator of COV(X, Y) snce t has these same drawbacks, but not a good ndcator of the correlaton between two varables. Often, the term coeffcent as n correlaton coeffcent means standardzed values (such as 0 to 1 or -1 to 1), wthout physcal dmenson. Coeffcents are often used for measurng some property of objects n engneerng and scences such as drag coeffcent. Appendx: Scatter Plots Scatter plots are smlar to lne graphs n that they use horzontal and vertcal axes to plot data ponts. However, they have a very specfc purpose. Scatter plots show how much one varable s related to another. The relatonshp between two varables s called ther correlaton. Scatter plots usually consst of a large body of data. The closer the data ponts come when plotted to makng a straght lne, the hgher the correlaton between the two varables s or the stronger the lnear relatonshp s. 13

If the data ponts make a straght lne gong from the orgn out to hgh x- and y-values, then the varables are sad to have a postve correlaton. If the lne goes from a hgh-value on the y-axs down to a hgh-value on the x- axs, the varables have a negatve correlaton. A perfect postve correlaton s gven the value of 1. A perfect negatve correlaton s gven the value of -1. If there s absolutely no correlaton present, the value gven s 0. The closer the number s to 1 or -1, the stronger the correlaton s, or the stronger the lner relatonshp between the varables s. The closer the number s to 0, the weaker the correlaton. So somethng that seems to knd of correlate n a postve drecton mght have a value of 0.67, whereas somethng wth an extremely weak negatve correlaton mght have the value -0.21. An example of a stuaton where you mght fnd a perfect postve correlaton, as we have n the graph on the left above, would be when you compare the total amount of money spent on tckets at the move theater wth the number of people who go. Ths means that every tme that "x" number of people go, "y" amount of money s spent on tckets wthout varaton. An example of a stuaton where you mght fnd a perfect negatve correlaton, as n the graph on the rght above, would be f you were comparng the speed at whch a car s gong to the amount of tme t takes to reach a destnaton. As the speed ncreases, the amount of tme decreases. 14

On the other hand, a stuaton where you mght fnd a strong but not perfect postve correlaton would be f you examned the number of hours students spent studyng for an exam versus the grade receved. Ths would not be a perfect correlaton because two people could spend the same amount of tme studyng and get dfferent grades. However, n general, the rule wll hold true that as the amount of tme studyng ncreases so does the grade receved. Let us take a look at some examples. The graphs that are shown above both have perfect correlatons, so ther values are 1 and -1. The graphs below obvously do not have perfect correlatons. Whch graph would have a correlaton of 0? What about 0.7? -0.7? 0.3? -0.3? Clck on Answers when you thnk that you have them all matched up. 15

All ths gven n ths appendx s found at http://www.mste.uuc.edu/courses/c330ms/youtsey/scatternfo.html Also note that correlaton s used nterchangeably wth correlaton coeffcent n ths appendx. However, correlaton s a property of two varables and correlaton coeffcent s a measure for the property of correlaton. Appendx: Formula of ˆρ x, y Let us understand the formula. The denomnator of the formula s postve snce the three factors are all postve n the denomnator. The numerator of the formula s the one that makes the value of ˆρ x, y between 1 and 1, just lke ρ. Another pont about the denomnator s that t makes the ˆρ x, y dmensonless (untless or dmensonless), just lke ρ. Suppose, X s n $ and Y s n pounds, then the data from X and x are n $ and the data from Y and y are n pounds. As a result, the numerator s n the dmenson of $*pounds. In the denomnator, (n 1) has no physcal dmenson but s x and s y have dmensons n $ and pounds, respectvely. So, the quotent of the formula for ˆρ x, y has no physcal dmenson (untless). 16

It s not only sensble but also mportant to use an estmator whch produces an estmate of value between 1 and 1 and of no dmenson to estmate a parameter whose value s between 1 and 1 and dmensonless. Generally, bvarate data are graphcally represented as ponts n a rectangular coordnate system on a plane defned by a horzontal axs and a vertcal axs. If the varables are X and Y, then the plane s the x-y plane that you learned n the prerequstes. If the correlaton of X and Y s postvely hgh (that s, ρ x,y s close to 1), then the scatter plot of the data (whch are bvarate data) should show the ponts tghtly clustered along a straght lne wth a postve slope. Look at the numerator of the formula; x and y are subtracted from the data from X and data from Y respectvely, whch are a horzontal shft and a vertcal shft of all the ponts. They shft the pont ( x, y ), whch s the center of all the ponts, to the orgn (0, 0). That s, all the ponts are tghtly clustered along a straght lne of postve slope gong through the orgn. For nstance, the followng bvarate data have the center of the data, x = 7 and y = 5.5, whch so happen to be the thrd data pont. Subtracton of x = 7 from the x-coordnates and y = 5.5 from y-coordnates of the ponts results n shftng the center of data (7, 5.5) to the orgn (0, 0). (X, Y) (X - x, Y - y ) (3, 8.5) (-4,3.0) (6, 6.4) (-1, 0.9) (7, 5.5) (0, 0) (7, 4.8) (0, -0.7) (12, 2.3) (5, -3.2) Also, see the scatter plots of these data ponts gven below. Ths frst scatter dagram s for the orgnal (X, Y) s n the table. 17

Example Y 9 8 7 6 5 4 3 2 1 0 0 5 10 15 X Seres1 The second scatter plot s for (X - x, Y - y ) s n the table. Both scatter plots are produced by Excel. Example Y - Y bar 4 3 2 1 0-5 -1 0 5 10-2 -3-4 X - X bar Seres1 The ponts n the frst scatter plot got shfted down and to the left by the subtractons of the averages, gven n the second scatter plot. The center of the ponts s now at the orgn n the second scatter plot. All the ponts, except for one on the y-axs, are n the second or fourth quadrant n the second scatter plot. Each of these ponts has the x- and y-coordnates of opposte sgns. Thus, ts product s negatve, and the sum of all the products of these ponts results n a negatve number, whch makes sense snce these bvarate data came from X and Y wth ρ X,Y close to -1. 18

Now, suppose that all the shfted ponts (x - x, y - y ) s are n the frst and thrd quadrants. In the frst quadrant, (x - x ) and (y - y ) are both postve, and, hence, (x - x )(y - y ) s are all postve. In the thrd quadrant, (x - x ) and (y - y ) are both negatve, and, hence, (x - x )(y - y ) s are all postve. So, n the numerator of the formula for ˆρ x, y, added are all the postve numbers, whch results n a postve number for the numerator. Ths results n a hgh postve number (close to 1) of ˆρ x, y, whch makes sense snce these data come from X and Y wth ρ X,Y close to 1. If the ponts are clustered to a straght lne of a postve slope but not tghtly to the lne, then some ponts (x - x, y - y ) s are n the second and fourth quadrants and (x - x )(y - y ) s are negatve n these quadrants. So, when all the (x - x )(y - y ) s are added up for the numerator, t does not add up to as hgh a postve number (not close to 1). Ths results n a postve number but closer to 0 for ˆρ x, y, whch makes sense snce ρ x,y s not close to 1 and, hence, the ponts do not cluster tghtly to a straght lne. In fact, f there s no correlaton between X and Y (that s, ρ x,y = 0), then the ponts can be all scattered somewhat unformly n a crcle wth the center ( x, y ). After subtractng x and y from the measurements from X and measurement from Y respectvely, there are about the equal numbers, n/4, of (x - x, y - y ) s n the each quadrant, whch means about the same number of negatve numbers and postve numbers are added n the numerator. Also, these ponts are scatter almost symmetrcally about the x-axs and the y-axs. As a result, each negatve (x - x )(y - y ) has a correspondng postve (x - x )(y - y ) whch are close to each other n the absolute values. They cancel each other whle added. Ths results n a value close to zero for the numerator of the formula, and, consequently, n a value close to zero for ˆρ x, y, whch makes sense snce ρ x,y = 0 s estmated. Smlarly, f ρ x,y s very close to -1, then the ponts are tghtly clustered to a straght lne wth a negatve slope. All (x - x )(y - y ) s are n the second and fourth quadrants so they are negatve. Negatve numbers are added n 19

the numerator, whch results n a number close to 1 for ˆρ x, y, whch makes sense snce ρ x,y close to 1 s estmated. If ρ x,y s not close to 1 but stll negatve, then some of (x - x, y - y ) s are n the frst and thrd quadrants, resultng n a few number of postve (x - x )(y - y ) s. Ths results n a negatve number but away from 1 and closer to 0 for ˆρ x,y, whch agan makes sense. So, the numerator of the formula makes sense. In fact, to obtan the nformaton from the ponts (whch are data) as to how tghtly they are clustered along a straght lne (the degree of the lnearty between X and Y), the numerator must be as gven n the formula. Try to come up wth other way of obtanng the same nformaton from data. It s very dffcult to do. By the way, you now know exactly why we subtract x and y n the numerator. Can you wrte the reason out wth one sentence? You already know the reason why we have s x and s y n the denomnator. The factor, n-1, n the denomnator s to adjust the value of the numerator to the sze of data, makng t per data pont (almost), just lke the n-1 n the formula of the sample standard devaton. Now, you understand the ˆρ x, y and the formula of ˆρ x, y, whch also helps you understand ρ X,Y. Appendx: Hypothess Testng on Correlaton You can do hypothess testng on correlaton as testng Ho: There s no correlaton between the two varables (ρ = 0). vs Ha: There s some correlaton between the two varables (ρ 0). Ths s a two-taled test and typcally performed by fndng the p-value of sgnfcance testng from the observed value of the test statstc. The test statstc s 20

/, and the observed value s computed by substtutng value of the sample correlaton coeffcent for (computed from bvarate data) and the number of the pars n the data for n. For nstance, =-0.9839499 s computed from bvarate data whch consst of four pars. Then, the observed value s computed as (-0.9839499) / -7.798. Now, go to the followng webste, http://surfstat.anu.edu.au/surfstat-home/tables/t.php or http://stes.csn.edu/mgreenwch/stat/t.htm. Clck the crcle under the last dstrbuton, nput 2 under d.f. and - 7.798 under t value, and clck on between t value and probablty. You do not see t here, but t s there n the webste. d.f. t value probablty Then, you should see the p-value of 0.0161 under probablty. However, the p-value can be obtaned more easly by drectly nputtng data to the followng webste. Let us do ths wth the webste, 21

http://home.ubalt.edu/ntsbarsh/busness-stat/otherapplets/correlaton.htm You should put your bvarate data horzontally (not vertcally as often gven) on the rght of X and on the rght of Y n the grd. Then, you clck on the button TEST FOR CORRELATION and should fnd the sample correlaton coeffcent at Correlaton (X,Y), the observed value at Test-statstc (unfortunately, t s ncorrect) and the p-value for one-taled test at The P-Value. Let us have some exercse. Test Ho: ρ X,Y = 0 vs Ha: ρ X,Y 0 at 5% sgnfcance level based on data, X Y 3 8.5 12 2.3 6 6.4 7 4.8 For ths, go to the webste and put 3, 12, 6, 7 horzontally rght of X (wthout commas, of course) and 8.5, 2.3, 6.4, 4.8 underneath horzontally rght of Y. Clck on the button TEST FOR CORRELATION. You should fnd -0.9839499 for sample correlaton coeffcent, -2.40856 (whch s ncorrect) for the observed value, and 0.00801 for the p-value (whch s correct). However, ths p-value s for a one-taled hypothess test. The hypothess testng on the correlaton coeffcent s two-taled hypothess testng. Thus, the p-value s 0.01602 (=0.00801*2). Ths p-value s greater than 1% so the null hypothess could not be rejected at 1% sgnfcance level. However, t s less than 5% so the null hypothess s rejected at 5% level. That s, the data provde some evdence (at least, at 5% sgnfcance level) that supports correlaton between X and Y. Note that you should use data of greater than 30 data ponts snce the p-value s obtaned from the standard Normal dstrbuton based on large data (CLT). Also, the sample correlaton coeffcent of -0.9839499 s very close to -1, but the null hypothess cannot be rejected at 1% sgnfcance level. Ths s because of the small sample sze. 22

Another note s that, n the webste, you should have 0 at Clamed Populaton s Correlaton. Ths s the null value and s 0 n ths secton. However, you can test hypotheses of Ho: ρ X,Y = 0.5 vs Ha: ρ X,Y 0.5 f nterested. In ths case, you need to set the clamed populaton s correlaton to 0.5. Several computer programmes (such as Excel) and webstes compute values of the sample correlaton coeffcent (by nputtng data) but do not produce observed values of the test statstc or p-values. Also, f someone gves you only a value of the sample correlaton coeffcent, along wth the sample sze, (but not the orgnal data) and asks you to do hypothess testng, then how can the hypothess testng be performed? There s a webste whch produces the p-value by nputtng a sample correlaton coeffcent value and ts sample sze. Here s one of them. http://www.danelsoper.com/statcalc/calculator.aspx?d=44 Let us have an example. Someone asks you to conduct the hypothess testng on the sample correlaton coeffcent of -0.984 whch was computed from bvarate data of a sample sze four (four pars or four data ponts). Then, go to the webste and nput -0.984 for Correlaton Value (r):, whch takes only three places after the decmal, and 4 for Sample Sze:. You should get 0.016000 for the p-value at Probablty (Two-Taled):. So, the null hypothess s rejected at 5% but not rejected at 1% sgnfcance level. Of course, one-taled hypothess tests and sgnfcance tests of the followng forms can be conducted as descrbed n the second and thrd sectons of the last chapter. The left-taled test: Ho: There s postve or no correlaton between the two varables (ρ 0). vs Ha: There s negatve correlaton between the two varables (ρ < 0). 23

The rght-taled test: Ho: There s postve or no correlaton between the two varables (ρ 0). vs Ha: There s negatve correlaton between the two varables (ρ > 0). From the last example of the sample correlaton effcent of-0.984 wth the sample sze of 4, the followng s the sgnfcance tests n these one-taled tests. Ho: ρ 0 n the left-taled test s rejected wth p-value of 0.008000. Ho: ρ 0 n the rght-taled test s rejected wth p-value of 0.922000. Here s a good revew on hypothess testng. Can you conduct the hypothess testng on these one-taled tests, say at α = 0.01? Copyrghted by Mchael Greenwch, 03/2017. 24