Correlation and Covariance

Similar documents
Regression, Inference, and Model Building

II. Descriptive Statistics D. Linear Correlation and Regression. 1. Linear Correlation

ST 305: Exam 3 ( ) = P(A)P(B A) ( ) = P(A) + P(B) ( ) = 1 P( A) ( ) = P(A) P(B) σ X 2 = σ a+bx. σ ˆp. σ X +Y. σ X Y. σ X. σ Y. σ n.

1 Inferential Methods for Correlation and Regression Analysis

Response Variable denoted by y it is the variable that is to be predicted measure of the outcome of an experiment also called the dependent variable

Simple Linear Regression

multiplies all measures of center and the standard deviation and range by k, while the variance is multiplied by k 2.

REGRESSION (Physics 1210 Notes, Partial Modified Appendix A)

Simple Regression. Acknowledgement. These slides are based on presentations created and copyrighted by Prof. Daniel Menasce (GMU) CS 700

S Y Y = ΣY 2 n. Using the above expressions, the correlation coefficient is. r = SXX S Y Y

11 Correlation and Regression

3/3/2014. CDS M Phil Econometrics. Types of Relationships. Types of Relationships. Types of Relationships. Vijayamohanan Pillai N.

Continuous Data that can take on any real number (time/length) based on sample data. Categorical data can only be named or categorised

STP 226 ELEMENTARY STATISTICS

Properties and Hypothesis Testing

Lecture 11 Simple Linear Regression

Linear Regression Analysis. Analysis of paired data and using a given value of one variable to predict the value of the other

Chapters 5 and 13: REGRESSION AND CORRELATION. Univariate data: x, Bivariate data (x,y).

Correlation Regression

TABLES AND FORMULAS FOR MOORE Basic Practice of Statistics

Final Examination Solutions 17/6/2010

Linear Regression Models

Chapter 4 - Summarizing Numerical Data

Introduction to Econometrics (3 rd Updated Edition) Solutions to Odd- Numbered End- of- Chapter Exercises: Chapter 4

TABLES AND FORMULAS FOR MOORE Basic Practice of Statistics

Solutions to Odd Numbered End of Chapter Exercises: Chapter 4

Least-Squares Regression

STP 226 EXAMPLE EXAM #1

EXAMINATIONS OF THE ROYAL STATISTICAL SOCIETY

Parameter, Statistic and Random Samples

Chapter If n is odd, the median is the exact middle number If n is even, the median is the average of the two middle numbers

Stat 139 Homework 7 Solutions, Fall 2015

Read through these prior to coming to the test and follow them when you take your test.

Data Analysis and Statistical Methods Statistics 651

a is some real number (called the coefficient) other

UNIVERSITY OF TORONTO Faculty of Arts and Science APRIL/MAY 2009 EXAMINATIONS ECO220Y1Y PART 1 OF 2 SOLUTIONS

Bivariate Sample Statistics Geog 210C Introduction to Spatial Data Analysis. Chris Funk. Lecture 7

INSTRUCTIONS (A) 1.22 (B) 0.74 (C) 4.93 (D) 1.18 (E) 2.43

Assessment and Modeling of Forests. FR 4218 Spring Assignment 1 Solutions

Exam II Covers. STA 291 Lecture 19. Exam II Next Tuesday 5-7pm Memorial Hall (Same place as exam I) Makeup Exam 7:15pm 9:15pm Location CB 234

t distribution [34] : used to test a mean against an hypothesized value (H 0 : µ = µ 0 ) or the difference

SIMPLE LINEAR REGRESSION AND CORRELATION ANALYSIS

NCSS Statistical Software. Tolerance Intervals

ECON 3150/4150, Spring term Lecture 3

CEE 522 Autumn Uncertainty Concepts for Geotechnical Engineering

University of California, Los Angeles Department of Statistics. Simple regression analysis

LINEAR REGRESSION ANALYSIS. MODULE IX Lecture Multicollinearity

Linear Regression Demystified

(all terms are scalars).the minimization is clearer in sum notation:

Lecture 6 Chi Square Distribution (χ 2 ) and Least Squares Fitting

Regression, Part I. A) Correlation describes the relationship between two variables, where neither is independent or a predictor.

INTRODUCTORY MATHEMATICS AND STATISTICS FOR ECONOMISTS

Chapter 1 (Definitions)

Simple Linear Regression

Lecture 6 Chi Square Distribution (χ 2 ) and Least Squares Fitting

Gotta Keep It Correlatin

Formulas and Tables for Gerstman

Dr. Maddah ENMG 617 EM Statistics 11/26/12. Multiple Regression (2) (Chapter 15, Hines)

Economics 250 Assignment 1 Suggested Answers. 1. We have the following data set on the lengths (in minutes) of a sample of long-distance phone calls

Lecture 1. Statistics: A science of information. Population: The population is the collection of all subjects we re interested in studying.

Mathematical Notation Math Introduction to Applied Statistics

Important Formulas. Expectation: E (X) = Σ [X P(X)] = n p q σ = n p q. P(X) = n! X1! X 2! X 3! X k! p X. Chapter 6 The Normal Distribution.

MCT242: Electronic Instrumentation Lecture 2: Instrumentation Definitions

Chapter 22. Comparing Two Proportions. Copyright 2010, 2007, 2004 Pearson Education, Inc.

Chapter 22. Comparing Two Proportions. Copyright 2010 Pearson Education, Inc.

Worksheet 23 ( ) Introduction to Simple Linear Regression (continued)

Elementary Statistics

Common Large/Small Sample Tests 1/55

Paired Data and Linear Correlation

µ and π p i.e. Point Estimation x And, more generally, the population proportion is approximately equal to a sample proportion

9. Simple linear regression G2.1) Show that the vector of residuals e = Y Ŷ has the covariance matrix (I X(X T X) 1 X T )σ 2.

Chapter 2 Descriptive Statistics

UNIT 11 MULTIPLE LINEAR REGRESSION

Summarizing Data. Major Properties of Numerical Data

Correlation and Regression

BIOS 4110: Introduction to Biostatistics. Breheny. Lab #9

Computing Confidence Intervals for Sample Data

PROVING CAUSALITY IN SOCIAL SCIENCE: A POTENTIAL APPLICATION OF OLOGS

ECE 8527: Introduction to Machine Learning and Pattern Recognition Midterm # 1. Vaishali Amin Fall, 2015

Chapter 8: Estimating with Confidence

Chapter 12 Correlation

Topic 9: Sampling Distributions of Estimators

Simple Linear Regression Matrix Form

Pearson Edexcel Level 3 Advanced Subsidiary and Advanced GCE in Statistics

Number of fatalities X Sunday 4 Monday 6 Tuesday 2 Wednesday 0 Thursday 3 Friday 5 Saturday 8 Total 28. Day

Correlation. Two variables: Which test? Relationship Between Two Numerical Variables. Two variables: Which test? Contingency table Grouped bar graph

Summary: CORRELATION & LINEAR REGRESSION. GC. Students are advised to refer to lecture notes for the GC operations to obtain scatter diagram.

bwght = cigs

7-1. Chapter 4. Part I. Sampling Distributions and Confidence Intervals

Ismor Fischer, 1/11/

Statistical Properties of OLS estimators

Refresher course Regression Analysis

Algebra of Least Squares

Nonlinear regression

Stat 421-SP2012 Interval Estimation Section

First, note that the LS residuals are orthogonal to the regressors. X Xb X y = 0 ( normal equations ; (k 1) ) So,

Open book and notes. 120 minutes. Cover page and six pages of exam. No calculators.

Lecture 22: Review for Exam 2. 1 Basic Model Assumptions (without Gaussian Noise)

Instructor: Judith Canner Spring 2010 CONFIDENCE INTERVALS How do we make inferences about the population parameters?

Random Variables, Sampling and Estimation

Transcription:

Correlatio ad Covariace Tom Ilveto FREC 9 What is Next? Correlatio ad Regressio Regressio We specify a depedet variable as a liear fuctio of oe or more idepedet variables, based o co-variace Regressio provides estimates of the relatioship betwee the depedet variable ad the idepedet variable(s) via a equatio of a lie The estimates, called coefficiets, ca be based o a sample ad ca be tested via a hypothesis test or cofidece iterval =.977 +.9* Correlatio ad Regressio Correlatio A measure of associatio betwee two variables Expressed as a liear relatioship Based o the co-variace - how two variables vary about their meas together Ca be show i a visual way via a scatterplot Bivariate Fit of By r =. Correlatio ad Regressio A focus o the variace!( X " X ) = TSS Total Sum of Squares Deviatios ( X " X )! = MS MeaSquared Deviatio " A focus o the co-variace #( X i " X )( Y i "Y ) i= Cov XY = A focus o the equatio of a lie Y = a + b*x where a is the itercept ad b is the slope

Let s revisit the Variace I statistics we are iterested i how a variable varies about its mea We represeted this as the Variace - the Mea Squared Deviatio!( X " X ) = TSS Total Sum of Squares Deviatios ( X " X )! = MS Mea Squared Deviatio " Basics of Co-Variace Let s start with a basic graph of a Y-variable vs a X variable. I will dissect the graph with the mea of X ad the Mea of Y Y-mea II Above Y-mea Below X-mea Below Y-mea Below X-mea III I Above Y-mea Above X-mea Below Y-mea Above X-mea IV X-mea 7 The Co-Variace Basics of Co-Variace The Covariace looks at how two variables, X ad Y, vary about their meas together We express it as a average, divided by (ot -) #( X i " X ) Y i "Y i= Cov XY = ( ) Cov XY = SS XY The covariace is a basic buildig block of correlatio, regressio, ad the Geeral Liear Model Y-mea II Values that ted to fall here ad III X-mea I here reflect egative covariace IV

Basics of Co-Variace Y-mea 9 II here reflect positive covariace III X-mea I Values that ted to fall here ad IV States with the Smartest Kids Here are distributio statistics th ad th grade math Stem ad Leaf.% maximum Mea.7 Stem Leaf Cout 99.% Std Dev. 97.%.7 Std Err Mea. 9 Upper 9% Mea. 7.% 7 Lower 9% Mea. 9.% media N..% Sum Wgt. 7.% Sum 9. 7 9.% Variace..% Skewess..% miimum Kurtosis. CV.7 N Missig. represets. Stem ad Leaf.% maximum 7 Mea 7. Stem Leaf 99.% 7 Std Dev. 7 97.%. Std Err Mea..7 Upper 9% Mea. 7.% Lower 9% Mea..% media 7 N..% Sum Wgt..%. Sum 7..% Variace 9..% Skewess.7 9.% miimum Kurtosis. CV. 7 N Missig. represets. Cout States with the Smartest Kids This is some data o states plus Washigto D.C. ad army bases overseas The key variables are the percet of studets i 9 who scored at a advaced level or higher for th ad th grade math, ad th ad th grade readig. Ay thoughts o this data? Smartest Kids Data: Covariace of th Math ad th Math Most of the data poits fall ito quadrats I ad III Positive co-variace As th grade percet icreases, so does th grade percet Bivariate Fit of By Covariace Matrix... 9.79 Fit Mea The umbers o the diagoal are the variaces - the covariace of a umber with itself is the variace

Shortcomigs of Co-Variace The covariace betwee two variables is a useful cocept it is the buildig block for regressio ad other multivariate techiques But as a measure of associatio it has limits It is symmetrical - ot a bad thig It is ubouded ukow high or low Covariace Matrix.. It is difficult to determie what the represets - a lot? a little? just how much???? Expressed i awkward cross-product uits. 9.79 Bivariate Fit of By Fit Mea Smartest Kids Data Most of the data poits fall ito quadrats I ad III Positive co-variace As th grade percet icreases, so does th grade percet Covariace Matrix... 9.79 Pearso Correlatio Coefficiet - r The correlatio coefficiet (r) is the co-variace adjusted for the stadard deviatios of both variables The adjustmet is simple, ad it makes it so much easier to iterpret r = r = Cov XY s X s Y #(X " X )(Y "Y ) # # (X " X ) (Y "Y ) r = SS XY SS X SS Y Properties of r Correlatio Coefficiet r Based o a liear measure of associatio Bouded betwee - ad Symmetrical relatioship: r XY = r YX Easier to iterpret Ivariat to liear scalig add/subtract or multiply/divide by a costat does ot chage the value of r betwee two variables Example: The correlatio betwee the respodet s educatio ad icome does ot chage if you express icome i total dollars or per $

Iterpretatio of r The closer the correlatio is to : the more perfect positive liear relatioship If r = the all values would fall o a straight lie, upward slope The closer the correlatio is to : The more perfect egative liear relatioship If r = - the all values would fall o a straight lie, dowward slope The scatterplot is a visual depictio of the correlatio coefficiet Iterpretatio of r meas o liear relatioship No-Liear Relatioship with a Near-Zero r........ 7 9 Scatter Plot Of Crab Force by Height The correlatio is., a strog positive correlatio Bivariate Fit of By Iterpretatio of r Oe other iterestig iterpretatio of r The square of r is equal to R-square, a measure of associatio i Regressio Oly i the case of a bivariate regressio - oe idepedet variable Ad it moves us toward defiig oe variable as explaiig the other This meas that r ca be iterpreted as the percet of variability i a variable that is explaied by the other variable

Iterpretig a correlatio coefficiet: Rules of Thumb for Narratives The followig is a table givig guidelie for arratives ivolvig correlatios. For simplicity sake, the table is based o the absolute value of the correlatio ( r ) Ad the exact descriptio depeds upo the subject ad disciplie Correlatio Rage Percet Variability Explaied (r ) Descriptio. to. to % Weak. to.9 % to % Moderate. to.7 % to % Moderately Strog.7 to. 7% to % Strog Some poiters i correlatio ad covariace Correlatio ad co-variace requires the umber of observatios for all variables be the same cases with missig values are excluded. With Excel, this is eve more of a problem Try to put the variable you are most iterested i first colum (i.e., the Depedet Variable). The you read dow the first colum to fid the relatioship with the depedet variable with other variables Readig the correlatio betwee other variables requires you to move across rows ad dow colums To get the covariace ad correlatio I Excel it is easy Tools!!!! Data Aalysis!!!! Covariace! or Correlatio!! Iput Rage (click to the right ad grab the data - i all four colums icludig labels)! Grouped by Colums!!!! Labels i first row (Yes) I JMP you eed to go to Multivariate Methods Multivariate List the variables Click the Hot Poit to get correlatios or covariaces Omi-Bar Study You are the marketig maager for OmiFoods ad you are plaig a atio-wide itroductio of a eergy bar, OmiPower. The bar was first marketed to high ed athletes ad moutai climbers, but ow is more popular with the geeral public. The compay wats to test market the bars ad determie the effect of price ad i-store promotios o the sales of the bars. They desig a study ad test OmiPower i a sample of stores i a supermarket chai. The depedet variable is Sales i dollars. The idepedet variables are price ad promotio. Whole values have bee carefully chose for the study. Price i three levels: $.9, $.79, ad $.99 Promotio i store i Three Levels: $, $, $

A closer look at sales Covariace ad Correlatio.% maximum 99.% 97.% 7.%.% media.%.%.%.%.% miimum.. 7. 9 7 7 7 Mea Std Dev Std Err Mea Upper 9% Mea Lower 9% Mea N Sum Wgt Sum Variace Skewess Kurtosis CV N Missig 9.7..7 7..9... 79. -.7 -.77.7. Stem ad Leaf Stem Leaf 7 79 99 7 Cout Note: Covariace difficult to iterpret Correlatios are relatively straight-forward Little correlatio betwee Price ad Promotio 7 represets 7 Mea level of sales is $,9 The media is cosiderably higher at $, A fair amout of spread i the data: CV is. Std. Dev is $, Covariace Matrix PRICE PROMOTION 79. -. 99. PRICE PROMOTION -. 99.. -.7 -.7. Correlatios PRICE PROMOTION 7. -.7. PRICE PROMOTION -.7.. -.97 -.97. The correlatios are estimated by REML method. Let s look at the relatioship of Sales with Price ad Sales with Promotio Sales by Price has a dowward slopig relatioship. As Price goes up, sales go dow - egative covariace It looks liear ad moderately strog Sales by Promotio has a upward slopig relatioship As Promotio goes up, sales go up - positive covariace It looks liear ad moderately strog Bivariate Fit of By PRICE Bivariate Fit of By PROMOTION It is a Easy step to Regressio Bivariate Fit of By PRICE...7.7..9 PRICE...7.7..9 PRICE PROMOTION

Real Life Correlatio Example Cliet:! Nicholas Hidell, Quip Laboratories The compay had two ways to measure how clea the labs were CFU ad RLU Oe was more expesive ad preferred by the compay The other was cheaper ad preferred by the cliet They wated to show the cliet that the two measures were ot the same 9 Distributios RATING 9 7.% 99.% 97.% 7.%.%.%.%.%.%.% maximum media miimum Mea Std Dev Std Err Mea upper 9% Mea lower 9% Mea N Sum Wgt Sum Variace Skewess Let s look at a example of correlatio ad covariace 9. 9. 9. 7.9 7........97.99..77.9..79.77 SALARY 9 7.% 99.% 97.% 7.%.%.%.%.%.%.% maximum media miimum Mea Std Dev Std Err Mea upper 9% Mea lower 9% Mea N Sum Wgt Sum Variace Skewess.. 9..9 79. 7.. 9.... 7..777.79 7.7 9.99 7.9.97 YEARS.% 99.% 97.% 7.%.%.%.%.%.%.% maximum media miimum Mea Std Dev Std Err Mea upper 9% Mea lower 9% Mea N Sum Wgt Sum Variace Skewess.. 7..9. 7....77....977..77 7.997.9. ORIGIN Outside Compay Iside Compay Frequecies Level Iside Compay Outside Compay Total N Missig Levels Cout Prob.7.. Let s look at a example of correlatio ad covariace The followig is some data about mid-level maagers i a compay. The variables are: RATING, a ratig scale of the maagers from to ; SALARY, the salary of the maager i $,; YEARS, years of service at the compay; ORIGIN, a dummy variable idicatig whether they were promoted iside the compay (coded as ) or were recruited from outside the compay (coded as ). At this poit we wo t worry about a depedet or idepedet variable The Covariace Matrix for Maager Ratigs Data The covariace matrix has the variaces o the diagoal (populatio variace based o ) ad the co-variaces o the off-diagoal. It is a symmetric matrix. Covariace Matrix RATING SALARY YEARS ORIGIN RATING SALARY YEARS ORIGIN.799.7.9 -.79.7.9 -.99 -.797.9 -.99.9. -.79 -.797..

The Correlatio Matrix for Maager Ratigs Data The covariaces are stadardized betwee - to The diagoal is ow - a variable is perfectly correlated with itself It is a symmetrical matrix Correlatios RATING SALARY YEARS ORIGIN RATING SALARY YEARS ORIGIN...77 -... -. -.9.77 -...7 -. -.9.7. Iterpretatio of Maager Ratigs Data There is a moderately strog positive relatioship betwee SALARY ad RATING - those that get higher salaries ted to have higher ratigs Almost o relatioship betwee YEARS i the compay ad the RATING (r =.77), but there is a egative relatioship betwee YEARS ad SALARY Bivariate Fit of SALARY By RATING 9 SALARY 7 7 9 RATING Correlatios RATING SALARY YEARS ORIGIN RATING...77 -. SALARY.. -. -.9 YEARS.77 -...7 ORIGIN -. -.9.7.