MATH 1070 Introductory Statistics Lecture notes Relationships: Correlation and Simple Regression

Similar documents
Business Statistics. Lecture 10: Correlation and Linear Regression

Topic 10 - Linear Regression

Chapter 14. Statistical versus Deterministic Relationships. Distance versus Speed. Describing Relationships: Scatterplots and Correlation

Important note: Transcripts are not substitutes for textbook assignments. 1

Chapter 4 Describing the Relation between Two Variables

Dr. Allen Back. Sep. 23, 2016

Relationships Regression

HOLLOMAN S AP STATISTICS BVD CHAPTER 08, PAGE 1 OF 11. Figure 1 - Variation in the Response Variable

Business Statistics. Lecture 9: Simple Regression

4.1 Introduction. 4.2 The Scatter Diagram. Chapter 4 Linear Correlation and Regression Analysis

Linear Regression. Linear Regression. Linear Regression. Did You Mean Association Or Correlation?

The Simple Linear Regression Model

REVIEW 8/2/2017 陈芳华东师大英语系

Objectives. 2.3 Least-squares regression. Regression lines. Prediction and Extrapolation. Correlation and r 2. Transforming relationships

BIOSTATISTICS NURS 3324

Chapter 3: Describing Relationships

THE PEARSON CORRELATION COEFFICIENT

Chapter 3: Describing Relationships

Simple Linear Regression

STA 218: Statistics for Management

MATH 2560 C F03 Elementary Statistics I LECTURE 9: Least-Squares Regression Line and Equation

3.2: Least Squares Regressions

Chapter 5 Least Squares Regression

Chapter 6: Exploring Data: Relationships Lesson Plan

MODELING. Simple Linear Regression. Want More Stats??? Crickets and Temperature. Crickets and Temperature 4/16/2015. Linear Model

1. Create a scatterplot of this data. 2. Find the correlation coefficient.

Chapter 8. Linear Regression. Copyright 2010 Pearson Education, Inc.

Chapter 7. Scatterplots, Association, and Correlation

Lesson 3-2: Solving Linear Systems Algebraically

Math 1710 Class 20. V2u. Last Time. Graphs and Association. Correlation. Regression. Association, Correlation, Regression Dr. Back. Oct.

Biostatistics: Correlations

+ Statistical Methods in

Regression line. Regression. Regression line. Slope intercept form review 9/16/09

AP Statistics. Chapter 6 Scatterplots, Association, and Correlation

Bivariate Data Summary

Lecture 16 - Correlation and Regression

Chapter 4. Regression Models. Learning Objectives

CHAPTER 5 LINEAR REGRESSION AND CORRELATION

Outline. Lesson 3: Linear Functions. Objectives:

CHAPTER 4 DESCRIPTIVE MEASURES IN REGRESSION AND CORRELATION

Chapter 5 Friday, May 21st

Chapter 8. Linear Regression. The Linear Model. Fat Versus Protein: An Example. The Linear Model (cont.) Residuals

Chapter 7 Linear Regression

Correlation and Regression

Unit 5 Solving Quadratic Equations

Descriptive Statistics (And a little bit on rounding and significant digits)

STA Module 5 Regression and Correlation. Learning Objectives. Learning Objectives (Cont.) Upon completing this module, you should be able to:

Sect Polynomial and Rational Inequalities

A company recorded the commuting distance in miles and number of absences in days for a group of its employees over the course of a year.

Math 2 Variable Manipulation Part 6 System of Equations

9. Linear Regression and Correlation

Conceptual Explanations: Simultaneous Equations Distance, rate, and time

Chapter 10. Regression. Understandable Statistics Ninth Edition By Brase and Brase Prepared by Yixun Shi Bloomsburg University of Pennsylvania

Math 243 OpenStax Chapter 12 Scatterplots and Linear Regression OpenIntro Section and

Algebra & Trig Review

Correlation & Regression. Dr. Moataza Mahmoud Abdel Wahab Lecturer of Biostatistics High Institute of Public Health University of Alexandria

Regression Analysis: Exploring relationships between variables. Stat 251

Chapter 9: Roots and Irrational Numbers

The response variable depends on the explanatory variable.

Lecture 3. The Population Variance. The population variance, denoted σ 2, is the sum. of the squared deviations about the population

Chapter 4 Data with Two Variables

appstats8.notebook October 11, 2016

AP Statistics Unit 6 Note Packet Linear Regression. Scatterplots and Correlation

AP Statistics. Chapter 9 Re-Expressing data: Get it Straight

Linear Regression and Correlation. February 11, 2009

Draft Proof - Do not copy, post, or distribute. Chapter Learning Objectives REGRESSION AND CORRELATION THE SCATTER DIAGRAM

Chapter 4 Data with Two Variables

POL 681 Lecture Notes: Statistical Interactions

Introduction to Machine Learning Prof. Sudeshna Sarkar Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

6: Polynomials and Polynomial Functions

Regression, part II. I. What does it all mean? A) Notice that so far all we ve done is math.

1 Numbers, Sets, Algebraic Expressions

Herndon High School Geometry Honors Summer Assignment

Essential Math For Economics

Regression Models. Chapter 4. Introduction. Introduction. Introduction

Chapter 2: Looking at Data Relationships (Part 3)

Regression Models. Chapter 4

Example: Forced Expiratory Volume (FEV) Program L13. Example: Forced Expiratory Volume (FEV) Example: Forced Expiratory Volume (FEV)

Correlation Analysis

Regression and correlation. Correlation & Regression, I. Regression & correlation. Regression vs. correlation. Involve bivariate, paired data, X & Y

Interactions. Interactions. Lectures 1 & 2. Linear Relationships. y = a + bx. Slope. Intercept

Bivariate Relationships Between Variables

Lecture PowerPoints. Chapter 2 Physics: Principles with Applications, 6 th edition Giancoli

CRP 272 Introduction To Regression Analysis

Business Statistics. Chapter 14 Introduction to Linear Regression and Correlation Analysis QMIS 220. Dr. Mohammad Zainal

Unit 6 - Introduction to linear regression

Upon completion of this chapter, you should be able to:

Biostatistics and Design of Experiments Prof. Mukesh Doble Department of Biotechnology Indian Institute of Technology, Madras

DIFFERENTIAL EQUATIONS

MITOCW ocw-18_02-f07-lec02_220k

AP Statistics Two-Variable Data Analysis

Chapter 4: Regression Models

A-Level Notes CORE 1

Continuity and One-Sided Limits

Linear Functions A linear function is a common function that represents a straight line

Geometry 21 Summer Work Packet Review and Study Guide

HUDM4122 Probability and Statistical Inference. February 2, 2015

BNAD 276 Lecture 10 Simple Linear Regression Model

Midterm 2 - Solutions

Transcription:

MATH 1070 Introductory Statistics Lecture notes Relationships: Correlation and Simple Regression Objectives: 1. Learn the concepts of independent and dependent variables 2. Learn the concept of a scatterplot 3. Learn the interpretation of Pearson s correlation coefficient 4. Learn the equation for simple regression 5. Compute the regression equation for a given data set Explanatory and Response Variables We use the relationship between variables to explain observed phenomena. Does smoking cause cancer, for example. Or the effect of lower temperatures on bacterial growth. We use this information to explain the world around us. When investigating the relationship between variables we have to differentiate between which of the variables is causative and which is reactive. We usually think of one variable as causing some effect on the other. The variable that we think is causing the effect we refer to as an explanatory variable, also known as an independent variable. The variable being affected we refer to as a response variable or dependent variable. Here again we like to use symbols to represent concepts, and we use the letter x to represent the independent variable(s) and the letter y to represent the dependent variable(s). Scatterplots As we saw previously, graphs of various types present information in the least painful way possible. There is a special type of graph for investigating relationships, the scatterplot. This is also known as an x y plot or a dot plot. To create the scatterplot, put the explanatory (independent) variable along the x-axis and the response (dependent) variable along the y-axis. Place a dot (or other character) at the coordinates given by the values for the two variables. What are we looking for in this graph? Overall pattern Striking deviations from the pattern Form Direction Strength The direction of the dots is important as it tells us something about the relationship between the variables. When above-average values in one variable tend to accompany above-average variables in the other we say the variables have a positive association. When above-average values accompany below-average values we say the variables have a negative association. This is evidence by whether the points trend upward or downward. The strength of the relationship is shown by how closely the points on the plot follow a given form. The closer to a linear form the stronger the relationship. 1

Correlation While a scatterplot is a nice graphical representation, we would prefer something more compact and easier to use. The scatterplot is also open to too much interpretation as to strength and form of relation. The statistic we use to measure the relationship between two variables was developed by Karl Pearson, and English scholar and mathematician, in the late 1800 s. The statistic is named Pearson s r in his honor. The computation of r involves the use of standardized values for both the x and y variables. The sum of the products for each pair of variables is divided by the total number of observations minus one (see below). It is very important that you do not sort your data! Sorting the values for the two variables destroys the relationship. r = 1 n 1 n ( ) ( ) xi x yi ȳ i=1 This value has some interesting and useful characteristics. It is always between -1.0 and 1.0, never outside this range. Values close to zero indicate weak, or nonexistent, relationships. Values close to the extreme values (-1.0, 1.0) indicate strong relationships. If the value is actually 1.0 (or -1.0) we say it is a perfect linear relationship. Positive values for r indicate a positive relationship. Negative values indicate a negative relationship. Correlation has no units. Since it is based on standardized variables the units have been removed. Correlation also only works for linear relationships between quantitative variables. If the scatterplot does not show some linearity, correlation will not work. Also, we cannot compute a relationship between qualitative, or categorical, variables. Strength of the Relationship r value Interpretation 0.0 -.20 Slight; almost negligible relationship.20 -.40 Low correlation; definite but small relationship.40 -.70 Moderate correlation; substantial relationship.70 -.90 High correlation; marked relationship.90-1.00 Very high correlation; very dependable relationship From J.P. Guilford as cited in Basic Statistical Analysis, Sprinthall Caution - Association is NOT Causation Beware the lurking variable! It is tempting to interpret the correlation of two variables as implying a causative relationship when there is not one. In one example, a study found a correlation between number of televisions in a country and an increase in the life expectancy. This does not mean that more televisions leads to longer life, does it? There could be some other explanation that would include both phenomena. In this case there probably is - economic standard of living. A higher standard of living allows people to own television sets and probably have electricity to use as well. A higher standard of living also provides for better health care, which leads to higher life expectancy. s x s y 2

The standard of living is the true explanatory variable in this case, not the number of televisions owned. This is what we call a lurking variable, one that is not studied but has an effect on the relationship. Simple Linear Regression Once we ve established there is a relationship, we d like to make use of this knowledge. Specifically, we d like to be able to use this relationship to predict behavior of the response variable. Having established there is a relationship between square footage and sale price of a house, it would be useful to a prospective buyer to know what a house might list for given its size. To do this we use a technique known as linear regression. The name gives a hint as to what type of relationship must exist (linear) for this to work. We will use the equation of a line which we fit to the data as a predictive equation. Once we know what the line is we can plug in new values for the explanatory (independent) variables, turn the crank, and get the predicted output. Simple. But how do we identify the line? The first step in the process is to produce a scatterplot of the variables. (Note: This will only illustrate the simple case where one explanatory variable is assumed to adequately explain all the behavior of a single response variable.) Once we have the data points plotted we can fit a line to them. For any given set of data points there are an infinite number of lines that can be fit. The trick is to find the one that fits the best. To illustrate this we will use a small dataset and fit two lines to it 1 : x y 4 6 9 10 1 2 6 2 First let us plot a horizontal line (y = 5). This gives a predicted value of 5 for all values of y regardless of input values for x. To differentiate the real values of y from the predicted values, we will use the notation of y-hat (ŷ) to indicate the predicted values. Given this, we see there are errors in the prediction. We can measure the error by subtracting the predicted value from the actual value, like so: x y y ŷ 4 6 6 5 = 1 9 10 10 5 = 5 1 2 2 5 = 3 6 2 2 5 = 3 If we sum these errors we get zero, which isn t very interesting (as with the variance and standard deviation). So we will square these values to remove the sign x y y ŷ (y ŷ) 2 4 6 6 5 = 1 1 9 10 10 5 = 5 25 1 2 2 5 = 3 9 6 2 2 5 = 3 9 1 From Modern Elementary Statistics, 9th ed, Freund and Simon, Prentice Hall, 1997 3

Now the sum of our errors is 44, a much more meaningful number. This doesn t look very good, though. Let s try to fit a better line to the data such as y = 1 + x. This actually fits two of the points exactly [(1, 2), (9, 10)] which reduces our error greatly. Following the same procedure as before (subtracting the predicted value for y from the real value for y, squaring the result) we get the following: y = 5 y = 1 + x x y y ŷ (y ŷ) 2 y ŷ (y ŷ) 2 4 6 6 5 = 1 1 6 5 = 1 1 9 10 10 5 = 5 25 10 10 = 0 0 1 2 2 5 = 3 9 2 2 = 0 0 6 2 2 5 = 3 9 2 7 = 5 25 The sum of our squared terms is now 26, which is much better than the 44 we got with the horizontal line. Comparing the two lines we would conclude that the second line is a better fit. This line is called the least-squares line because its sum of squared errors is less than the other line s. The procedure we will use is called the method of least-squares precisely because of this. We want the line that minimizes the errors between the actual values for y and the predicted values for y (ŷ). We accept that there will be error since only a perfect line has all the data points exactly aligned and that would be rare (or manufactured). Calculating the Regression Line The equation for a regression line is the same as we learned before, only we use some slightly different symbols. The equation is written ŷ = a + bx We compute the value for b first since we actually use that value to calculate a. The formula for b is b = S xy S xx where S xy is the sum of squares for each pair of observations and S xx is the sum of squares for each x observation. The values for these are computed by the following formulas: S xy = xy ( x)( y) n S xx = x 2 ( x) 2 n If we have the correlation coefficient (r) we can use that with the standard deviations for x and y to calculate a value for b using the following formula: b = r s y s x To compute a we use a relatively simple formula: a = ȳ b x 4

Interpreting a and b What are these mysterious variables a and b? In the simplest terms we can think of a as the y-intercept and b as the slope of the line. We have to be careful with the regression line, though, because the values produced are not always rational. Yes the line crosses the y-axis at a when we plug in zero for x, but the value at zero may not make sense. For example, a regression line that predicts food expenditure (in hundreds of dollars) based upon income has the equation ŷ = 1.1414 + 0.2642x so when we plug in zero for x we get $1.1414 which means the family spends $114.14 when they have zero income. This is a problem known as extrapolation. The regression line is calculated using a given data set and the line is valid for all data points within the limits (minimum and maximum) of that data set. Outside those limits the regression line may not be valid. There is a demonstrated relationship between age and height in children, and height can reasonably be estimated from age with a sufficiently large data set, but growth stops eventually and age continues. Predicting height for someone at age 21 would probably yield a value that would be meaningless. The value for y gives the change in y due to a change of one unit in x (Mann, p. 637). 5