Statistics. Class 7: Covariance, correlation and regression

Similar documents
This module focuses on the logic of ANOVA with special attention given to variance components and the relationship between ANOVA and regression.

s e, which is large when errors are large and small Linear regression model

SIMPLE REGRESSION ANALYSIS. Business Statistics

Formula for the t-test

WISE Regression/Correlation Interactive Lab. Introduction to the WISE Correlation/Regression Applet

Reminder: Univariate Data. Bivariate Data. Example: Puppy Weights. You weigh the pups and get these results: 2.5, 3.5, 3.3, 3.1, 2.6, 3.6, 2.

Quantitative Techniques - Lecture 8: Estimation

Biostatistics and Design of Experiments Prof. Mukesh Doble Department of Biotechnology Indian Institute of Technology, Madras

Chapter 12 - Lecture 2 Inferences about regression coefficient

Midterm 2 - Solutions

Midterm 2 - Solutions

1.) Fit the full model, i.e., allow for separate regression lines (different slopes and intercepts) for each species

Six Sigma Black Belt Study Guides

Final Exam - Solutions

Ecn Analysis of Economic Data University of California - Davis February 23, 2010 Instructor: John Parman. Midterm 2. Name: ID Number: Section:

Objectives Simple linear regression. Statistical model for linear regression. Estimating the regression parameters

1 A Review of Correlation and Regression

Inference with Simple Regression

STATS DOESN T SUCK! ~ CHAPTER 16

Business Statistics. Lecture 10: Correlation and Linear Regression

REVIEW 8/2/2017 陈芳华东师大英语系

Final Exam - Solutions

Business Statistics. Lecture 9: Simple Regression

Inferences for Regression

Section 3: Simple Linear Regression

Chapter Learning Objectives. Regression Analysis. Correlation. Simple Linear Regression. Chapter 12. Simple Linear Regression

Variance. Standard deviation VAR = = value. Unbiased SD = SD = 10/23/2011. Functional Connectivity Correlation and Regression.

Linear Equations. Find the domain and the range of the following set. {(4,5), (7,8), (-1,3), (3,3), (2,-3)}

Statistics 512: Solution to Homework#11. Problems 1-3 refer to the soybean sausage dataset of Problem 20.8 (ch21pr08.dat).

Multiple Linear Regression

Taguchi Method and Robust Design: Tutorial and Guideline

Predicted Y Scores. The symbol stands for a predicted Y score

9 Correlation and Regression

Section 4.6 Negative Exponents

Mathematics for Economics MA course

If I see your phone, I wil take it!!! No food or drinks (except for water) are al owed in my room.

Inference for Regression Simple Linear Regression

MATH 1150 Chapter 2 Notation and Terminology

Chapter 12 - Part I: Correlation Analysis

Q1: What is the interpretation of the number 4.1? A: There were 4.1 million visits to ER by people 85 and older, Q2: What percent of people 65-74

Univariate analysis. Simple and Multiple Regression. Univariate analysis. Simple Regression How best to summarise the data?

t. y = x x R² =

Analysing data: regression and correlation S6 and S7

Interactions. Interactions. Lectures 1 & 2. Linear Relationships. y = a + bx. Slope. Intercept

Regression Analysis IV... More MLR and Model Building

Correlation Analysis

Inferences for Correlation

df=degrees of freedom = n - 1

ECO220Y Simple Regression: Testing the Slope

Stat 135, Fall 2006 A. Adhikari HOMEWORK 10 SOLUTIONS

Problem Set 1 ANSWERS

Psych 10 / Stats 60, Practice Problem Set 10 (Week 10 Material), Solutions

Review of Multiple Regression

General linear models. One and Two-way ANOVA in SPSS Repeated measures ANOVA Multiple linear regression

FREC 608 Guided Exercise 9

Ordinary Least Squares Regression Explained: Vartanian

Ch Inference for Linear Regression

Linear Regression. Chapter 3

Multiple Regression. Inference for Multiple Regression and A Case Study. IPS Chapters 11.1 and W.H. Freeman and Company

Business Statistics. Chapter 14 Introduction to Linear Regression and Correlation Analysis QMIS 220. Dr. Mohammad Zainal

STEP 1: Ask Do I know the SLOPE of the line? (Notice how it s needed for both!) YES! NO! But, I have two NO! But, my line is

1/11/2011. Chapter 4: Variability. Overview

Ordinary Least Squares Regression Explained: Vartanian

Applied Regression Analysis

Lecture 8 CORRELATION AND LINEAR REGRESSION

Graphing Linear Equations: Warm Up: Brainstorm what you know about Graphing Lines: (Try to fill the whole page) Graphing

Estadística II Chapter 4: Simple linear regression

This gives us an upper and lower bound that capture our population mean.

Correlation. A statistics method to measure the relationship between two variables. Three characteristics

III. Inferential Tools

Describing the Relationship between Two Variables

As always, show your work and follow the HW format. You may use Excel, but must show sample calculations.

Data Analysis and Statistical Methods Statistics 651

4.1 Introduction. 4.2 The Scatter Diagram. Chapter 4 Linear Correlation and Regression Analysis

Competing Models, the Crater Lab

Lecture 14. Analysis of Variance * Correlation and Regression. The McGraw-Hill Companies, Inc., 2000

Lecture 14. Outline. Outline. Analysis of Variance * Correlation and Regression Analysis of Variance (ANOVA)

A discussion on multiple regression models

Section 1.6. Functions

Stat 411/511 ESTIMATING THE SLOPE AND INTERCEPT. Charlotte Wickham. stat511.cwick.co.nz. Nov

Notes 11: OLS Theorems ECO 231W - Undergraduate Econometrics

Section 2.5 from Precalculus was developed by OpenStax College, licensed by Rice University, and is available on the Connexions website.

Business Statistics. Lecture 10: Course Review

Section 11: Quantitative analyses: Linear relationships among variables

(ii) Scan your answer sheets INTO ONE FILE only, and submit it in the drop-box.

Lecture 10 Multiple Linear Regression

Block 3. Introduction to Regression Analysis

JUST THE MATHS UNIT NUMBER 5.3. GEOMETRY 3 (Straight line laws) A.J.Hobson

Chapter 4 Data with Two Variables

q3_3 MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

1.3 Exponential Functions

1 Correlation and Inference from Regression

Chapter 8. Linear Regression. Copyright 2010 Pearson Education, Inc.

Copyright, Nick E. Nolfi MPM1D9 Unit 6 Statistics (Data Analysis) STA-1

ADMS2320.com. We Make Stats Easy. Chapter 4. ADMS2320.com Tutorials Past Tests. Tutorial Length 1 Hour 45 Minutes

Correlation and Regression

Simple Linear Regression

STATS216v Introduction to Statistical Learning Stanford University, Summer Midterm Exam (Solutions) Duration: 1 hours

Scatterplots and Correlation

Can you tell the relationship between students SAT scores and their college grades?

Transcription:

Class 7: Covariance, correlation and regression

Objective We saw in the last class how seats in parliament is approximately linearly related to population. Now we want a measure of how close a relationship is to linear. Then, we want to fit the best line possible to the data. Finally, we want to check whether the fit is reasonable.

Covariance The covariance is designed as a measure of whether the relationship between two variables is positive (generally increasing) or negative (generally decreasing). For data (x 1,y 1 ),, (x N, y N ), the covariance is: σ xy = 1 N x 1 xҧ y 1 തy + + (x N x)(y ҧ N തy) The quasi covariance is: s xy = 1 N 1 x 1 xҧ y 1 തy + + (x N x)(y ҧ N തy)

Covariance For an increasing relationship, most data are in the top right and bottom left regions so the covariance is positive. For a decreasing relationship, the covariance will be negative.

Covariance The covariance is σ xy 5.2 millions. Does this mean we have a very strong relationship? What would happen if we measured population in millions?

Covariance The relationship between population and seats in parliament looks the same but the covariance is now σ xy 5.2. We need an alternative measure which doesn t depend on the units of the data.

Correlation The correlation is r xy = σ xy σ x σ x = s xy s x s y. This is the covariance divided by the product of the standard deviations. Does this depend on the units? r xy = 0.995.

Correlation The correlation is r xy = σ xy σ x σ x = s xy s x s y. This is the covariance divided by the product of the standard deviations. The correlation is still the same. r xy = 0.995. How do we interpret this number?

Properties of the correlation coefficient 1 r xy 1. If r xy = 1, then x and y follow an exact, increasing or positive, linear relationship. If r xy = 1, then x and y follow an exact, decreasing, or negative linear relationship. If there is no relation between the two variables, then r xy = 0. The closer the correlation is to 1 (or -1), the closer the data are to a line. r xy = 0.995. The data are very close to a line.

Examples: different levels of correlation As the correlation gets closer to zero, the data are more spread out from a line.

Example: zero correlation doesn t mean no relationship Overall, the relationship is neither increasing nor decreasing.

Example: high correlation doesn t mean a line fit always looks good Overall, the points are close to a line, but these data are just the right hand side of the previous slide. Before trying to fit a regression line, we should always look at the data to see if this is appropriate.

Regression Often we want to predict the values of y in terms of x. If Spain decides to annex the district of Porto and turn this into a Spanish province, how many seats in parliament should it be assigned? If we fit a line y = α + β x to the data we have, then we can make the prediction by letting x be the population of Porto and calculating y.

Which regression line should we choose? We could fit many different lines to the data. We need to compare how well each line fits and select the best one.

Choosing a regression line For the data we have: (x 1,y 1 ),, (x N, y N ), if we fit a line y = a + b x, then the errors or residuals are r 1 = y 1 - (a+bx 1 ),, r N = y N - (a+bx N ), Here we have some of the residuals for the Green line. The residuals for the red line are much smaller.

Choosing a regression line We want to make the overall error as small as posible. We could try to choose a line so that the sum of the residuals is equal to zero. The red line and the yellow line at the mean value of Seats satisfy this. The problem is some residuals are negative and some positive.

Least squares We could also choose to minimize the sum of the absolute values of the errors. (Bad statistical properties) Instead we choose the line to minimize the sum of squared errors: r 12 + + r N2. Does this idea remind you of a measure of error we have seen before? The red line is the least squares fit..

Excel output SUMMARY OUTPUT REGRESSION STATISTICS Multiple R 0.995697051 R Square 0.991412617 Adjusted R Square 0.991240869 Standard Error 0.572180633 Observations 52 ANOVA df SS MS F Significance F Regression 1 1889.861235 1889.861235 5772.495575 2.50383E-53 Residuals 50 16.36953386 0.327390677 Total 51 1906.230769 Coefficients Standard error t Stat P-value Lower 95% Upper 95% Intercept 1.820683116 0.102335209 17.79136561 2.83304E-23 1.6151368 2.02622943 Population 6.99172E-06 9.20243E-08 75.97694108 2.50383E-53 6.80689E-06 7.1766E-06 This is scientific notation. E-06 means move the decimal point 6 places to the left. The line is Seats = 1.821 + 0.00000699 x Population

Prediction The line is: Seats = 1.821 + 0.00000699 x Population. The district of Porto has a population of 1,781,826. Therefore we would estimate that Porto should have approximately 1.821 + 0.00000699 x 1,781,826 = 14.28 seats in parliament

Prediction What if we wanted to make Perejil into a province? No-one lives there, except a few goats! Using the line: 1.821 + 0.00000699 x 0 = 1.821 seats in parliament. This doesn t seem reasonable! Predictions tend to be good within the data range and get worse the further away we are.

How well is our regression doing: the coefficient of determination REGRESSION STATISTICS Multiple R 0.995697051 R Square 0.991412617 Adjusted R Square 0.991240869 Standard Error 0.572180633 Observations 52 The coefficient of determination is 99.1%: the correlation squared x 100% This measures how much better our predictions are when we use regression line than when we use the global mean.

How well is our regression doing: residuals It is sensible to graph the residuals for our fitted regression. If everything is ok, the residuals should look fairly random. What do you think?

How well is our regression doing: residuals Remember the curve data. The line is quite close to the points. You can see a clear pattern in the residuals. The simple regression is not a good fit.

Exercise The following graphic compares levels of well-being (according to the Better Life Index) and wealth (according to GDP per person) in a number of different countries. In this case: a) The correlation between well-being and wealth is positive and the interquartile range of the wealth data is approximately 30. b) The correlation between well-being and wealth is zero and the range of the well-being data is approximately 3. c) The correlation between well-being and wealth is negative and the range of wealth data is approximately 30. d) None of the above.

Exercise The following graphic compares levels of well-being (according to the 2017 Sustainable Economic Development Assessment or SEDA score) and happiness level (according to the 2017 World Happiness Report) in a number of different countries. In this case: a) The correlation between the SEDA score and the happiness level is positive and close to 1 and the intercept of the regression line relating happiness to SEDA score is negative. b) The covariance between the SEDA score and the happiness level is positive and the standard deviation of the happiness level is bigger than 4. c) The interquartile range of the SEDA scores is bigger than 90 and the slope of the regression line relating happiness to the SEDA score is positive. d) None of the above.

Exercise The graph below is taken from The of Democide, by R.J. Rummell. Mark the correct response: a) The correlation between Total Power and War Dead is positive and close to 1. b) The correlation is exactly zero. c) The correlation is negative. d) The correlation is positive but close to zero.

Exercise The following graph relates the 2016 World Freedom Index (WFI) to the 2016 Economic Freedom Index (EFI) for alphabetically chosen countries of the world from A to C. The Ivory Coast (Côte d Ivoire) was included in A-C of the EFI with value 60 but not in the WFI (where it was classified as beginning with I). Use the regression line to estimate its WFI value. Do you think the estimate is reasonable? Why?