II. Descriptive Statistics D. Linear Correlation and Regression. 1. Linear Correlation

Similar documents
Response Variable denoted by y it is the variable that is to be predicted measure of the outcome of an experiment also called the dependent variable

1 Inferential Methods for Correlation and Regression Analysis

STP 226 ELEMENTARY STATISTICS

Simple Linear Regression

Infinite Sequences and Series

Polynomial Functions and Their Graphs

Lecture 11 Simple Linear Regression

3/3/2014. CDS M Phil Econometrics. Types of Relationships. Types of Relationships. Types of Relationships. Vijayamohanan Pillai N.

Gotta Keep It Correlatin

Correlation Regression

ACCESS TO SCIENCE, ENGINEERING AND AGRICULTURE: MATHEMATICS 1 MATH00030 SEMESTER / Statistics

11 Correlation and Regression

Measures of Spread: Standard Deviation

Chapter 12 Correlation

STP 226 EXAMPLE EXAM #1

Paired Data and Linear Correlation

ECON 3150/4150, Spring term Lecture 3

Regression, Part I. A) Correlation describes the relationship between two variables, where neither is independent or a predictor.

Statistics 511 Additional Materials

MA131 - Analysis 1. Workbook 2 Sequences I

6.3 Testing Series With Positive Terms

Summary: CORRELATION & LINEAR REGRESSION. GC. Students are advised to refer to lecture notes for the GC operations to obtain scatter diagram.

a is some real number (called the coefficient) other

The picture in figure 1.1 helps us to see that the area represents the distance traveled. Figure 1: Area represents distance travelled

Continuous Data that can take on any real number (time/length) based on sample data. Categorical data can only be named or categorised

Regression, Inference, and Model Building

Properties and Hypothesis Testing

Chapter Objectives. Bivariate Data. Terminology. Lurking Variable. Types of Relations. Chapter 3 Linear Regression and Correlation

Sequences I. Chapter Introduction

Least-Squares Regression

Correlation and Covariance

Castiel, Supernatural, Season 6, Episode 18

Chapter 4 - Summarizing Numerical Data

multiplies all measures of center and the standard deviation and range by k, while the variance is multiplied by k 2.

Read through these prior to coming to the test and follow them when you take your test.

Linear Regression Analysis. Analysis of paired data and using a given value of one variable to predict the value of the other

SIMPLE LINEAR REGRESSION AND CORRELATION ANALYSIS

REGRESSION (Physics 1210 Notes, Partial Modified Appendix A)

RADICAL EXPRESSION. If a and x are real numbers and n is a positive integer, then x is an. n th root theorems: Example 1 Simplify

Linear Regression Models

n outcome is (+1,+1, 1,..., 1). Let the r.v. X denote our position (relative to our starting point 0) after n moves. Thus X = X 1 + X 2 + +X n,

Assessment and Modeling of Forests. FR 4218 Spring Assignment 1 Solutions

Lecture 6 Chi Square Distribution (χ 2 ) and Least Squares Fitting

We will conclude the chapter with the study a few methods and techniques which are useful

Chapter 10: Power Series

ST 305: Exam 3 ( ) = P(A)P(B A) ( ) = P(A) + P(B) ( ) = 1 P( A) ( ) = P(A) P(B) σ X 2 = σ a+bx. σ ˆp. σ X +Y. σ X Y. σ X. σ Y. σ n.

Estimation for Complete Data

Understanding Samples

September 2012 C1 Note. C1 Notes (Edexcel) Copyright - For AS, A2 notes and IGCSE / GCSE worksheets 1

Lecture 6 Chi Square Distribution (χ 2 ) and Least Squares Fitting

S Y Y = ΣY 2 n. Using the above expressions, the correlation coefficient is. r = SXX S Y Y

Statistical Fundamentals and Control Charts

First, note that the LS residuals are orthogonal to the regressors. X Xb X y = 0 ( normal equations ; (k 1) ) So,

Regression and Correlation

MA131 - Analysis 1. Workbook 3 Sequences II

Stat 139 Homework 7 Solutions, Fall 2015

Regression and correlation

4.3 Growth Rates of Solutions to Recurrences

A quick activity - Central Limit Theorem and Proportions. Lecture 21: Testing Proportions. Results from the GSS. Statistics and the General Population

University of California, Los Angeles Department of Statistics. Simple regression analysis

A sequence of numbers is a function whose domain is the positive integers. We can see that the sequence

CEE 522 Autumn Uncertainty Concepts for Geotechnical Engineering

Statistical Properties of OLS estimators

TABLES AND FORMULAS FOR MOORE Basic Practice of Statistics

Chapter If n is odd, the median is the exact middle number If n is even, the median is the average of the two middle numbers

CHAPTER I: Vector Spaces

A statistical method to determine sample size to estimate characteristic value of soil parameters

Simple Regression. Acknowledgement. These slides are based on presentations created and copyrighted by Prof. Daniel Menasce (GMU) CS 700

Statistical Inference (Chapter 10) Statistical inference = learn about a population based on the information provided by a sample.

x c the remainder is Pc ().

Agreement of CI and HT. Lecture 13 - Tests of Proportions. Example - Waiting Times

Sequences A sequence of numbers is a function whose domain is the positive integers. We can see that the sequence

Median and IQR The median is the value which divides the ordered data values in half.

(3) If you replace row i of A by its sum with a multiple of another row, then the determinant is unchanged! Expand across the i th row:

Frequentist Inference

Random Variables, Sampling and Estimation

Homework 5 Solutions

P.3 Polynomials and Special products

10-701/ Machine Learning Mid-term Exam Solution

Chapters 5 and 13: REGRESSION AND CORRELATION. Univariate data: x, Bivariate data (x,y).

Some examples of vector spaces

Dealing with Data and Fitting Empirically

Recall the study where we estimated the difference between mean systolic blood pressure levels of users of oral contraceptives and non-users, x - y.

Number of fatalities X Sunday 4 Monday 6 Tuesday 2 Wednesday 0 Thursday 3 Friday 5 Saturday 8 Total 28. Day

Quadratic Functions. Before we start looking at polynomials, we should know some common terminology.

Academic. Grade 9 Assessment of Mathematics. Released assessment Questions

U8L1: Sec Equations of Lines in R 2

P1 Chapter 8 :: Binomial Expansion

3.2 Properties of Division 3.3 Zeros of Polynomials 3.4 Complex and Rational Zeros of Polynomials

Continuous Functions

LINEAR REGRESSION ANALYSIS. MODULE IX Lecture Multicollinearity

Sail into Summer with Math!

CORRELATION AND REGRESSION

10.6 ALTERNATING SERIES

Lesson 10: Limits and Continuity

Big Picture. 5. Data, Estimates, and Models: quantifying the accuracy of estimates.

Goodness-of-Fit Tests and Categorical Data Analysis (Devore Chapter Fourteen)

ECON 3150/4150, Spring term Lecture 1

Open book and notes. 120 minutes. Cover page and six pages of exam. No calculators.

Linear Regression Demystified

Transcription:

II. Descriptive Statistics D. Liear Correlatio ad Regressio I this sectio Liear Correlatio Cause ad Effect Liear Regressio 1. Liear Correlatio Quatifyig Liear Correlatio The Pearso product-momet correlatio coefficiet, deoted as r, describes a liear relatioship betwee two quatitative variables. It is importat to otice that whe lookig at this value, it oly idicates the liear relatioship. You could have aother kid of relatioship preset i the data. The r value idicates both the stregth of the liear associatio ad its directio. You will ot eed to calculate a r formula is below: value by had, but i case you are iterested the SXY r SSX SSY where SSX x x SSY SXY x y y y x x y y x y xy x y You will be required to iterpret what a r directio of the relatioship. value tells you. We will start with the Positive r suggests large values of X ad Y occur together ad that small values of X ad Y occur together. This meas that the slope of the lie that best fits the poits is positive. A example would be Experiece ad Salary. People with lower levels of experiece ted to have lower salaries ad people with more experiece ted to have higher salaries. Negative r suggests large values of oe variable ted to occur with small values of the other variable. This meas that the slope of the lie that best fits the poits is egative. A example would be Weight of a car ad Gas mileage. Light cars ted to have higher gas mileage ad heavier cars ted to have lower gas mileage. 1

So the sig of the r value tells us the directio of the relatioship. Stregth of the relatioship is measured by the actual value of the umber. By the term stregth, we mea how close are the poits to a lie? The closer the poits are to a lie, the stroger the relatioship. Because of the setup of r, the maximum value for r (i terms of absolute value) is 1. Below are some useful thigs to keep i mid. 1 r 1 If the there is perfect positive liear correlatio all data are exactly o a lie with positive slope If the there is perfect egative liear correlatio all data are exactly o a lie with egative slope If the there is o liear relatioship (keep i mid there could be aother type of correlatio) r 1 r 1 r The stroger the liear relatioship, the larger Geerally, we will say there is a strog relatioship if r (the closer to 1 this value will be). r.75 Lookig back to our example from the last sectio whe itroducig scatterplots we had X = Dosage of Drug ad Y = Reductio i Blood Pressure, what do you thik the r value will be? Remember based o the scatterplot that the poits had a strog positive liear relatioship. Not perfect but pretty close meaig the r value should be close to 1. If you calculate this value, you will get r. 9978. This should seem reasoable as it supports what we idetified i the graph. Aother measure, you will sometimes see reported is the R-squared value. It is commo for computer software to give you a R-squared value istead of r. This value represets the percet of variatio i Y explaied by the model. It measures the stregth of the relatioship ad i the liear case is simply calculated by squarig the r value. The higher R-squared is, the better the model. % R 1% For the Drug example r. 9978 R.9978.995 99.5 %

. Cause ad Effect Causal Research Whe the objective is to determie if a variable causes a certai behavior (whether there is a cause ad effect relatioship betwee variables) Note: it is ever possible to prove causality just based o the relatioship betwee two variables There is a strog statistical correlatio over the moths of the year betwee ice cream cosumptio ad the umber of assaults i the U.S. The r value for this data is above.9. Does this mea ice cream maufacturers are resposible for crime? No! The correlatio occurs statistically because the hot temperatures of summer icreases both ice cream cosumptio ad assaults (High values occur at the same time ad low values occur at the same time) Thus, correlatio does NOT imply causatio. This is oe of the biggest mistakes that I see i the iterpretatio of a correlatio. You should always keep i mid that other factors besides cause ad effect ca create a observed correlatio. To establish whether two variables are causally related you must establish all of the followig: 1) Time order - the cause must have occurred before the effect ) Co-variatio (statistical associatio) the correlatio coefficiet ad graph must show a strog relatioship betwee the depedet ad idepedet variable 3) Ratioale - there must be a logical ad compellig explaatio for why oe variable causes the other 4) No-spuriousess - it must be established that the idepedet variable X, ad oly X, was the cause of chages i the depedet variable Y; rival explaatios must be ruled out The first three of these ca be easily established i may cases. It is the fourth criteria which is hard ad ca rarely be show. To help idetify a relatioship as cause ad effect, a study should be performed may times. The study should yield the same results every time it is coducted. Give that the outside variables will differ from situatio to situatio, this helps rule out rival explaatios. Causal research is very complex ad the researcher ca rarely be certai that other factors are ot ifluecig a relatioship. 3

3. Liear Regressio Determiistic View This is the idea that Y is caused by X or that oce X has happeed, Y will follow. I this situatio, the exact value of Y is kow. The determiistic view is studied i a typical algebra class. However, a determiistic view whe applied to the behavior of may variables is ot possible. Regressio A techique used to predict variables (typically difficult to measure variables) based o a set of other variables (typically easier to measure variables). Liear Regressio Used to predict the value of Y (the respose variable), based o X (the explaatory variable) usig a liear equatio. Predict reactio time based o blood alcohol level. Reactio time is difficult to measure so istead we predict it with blood alcohol level which is easy to measure. The liear regressio model expresses Y as a fuctio of X plus radom error. Radom error reflects variatio i Y values. Keep i mid we are goig to measure X, so assumig we get a good measure there is o error i the X variable. However, whe we go to use X to predict Y, the predictio will ot be exact. Therefore, there is error i the Y variable. Graphically this error is represeted by the vertical distace betwee the poits ad the lie. The liear regressio model is: b b x Y 1 where b is the y-itercept b 1 is the slope The above formula is the same format as what you should be used to from a algebra class. However, the way we deote the relatioship is differet. It is importat you become familiar with this otatio. I order to use liear regressio, we must first make sure the model is reasoable. The scatter plot ad r should idicate a strog relatioship. If the model is ot reasoable, do ot fit a lie. It may still be possible to do regressio with a more complicated model. However, if there is o relatioship betwee the variables the regressio caot be used. I this class we will ot worry about more complicated models, but you should uderstad that a simple liear model is just oe of the may optios available. 4

Whe usig a liear regressio model, we eed the lie that is the best fit for our data. Sice our purpose will be to predict, we will wat to pick the lie that will miimize the error i the predictio. To accomplish this we will use the method of least-squares. Method of Least-Squares says that the sum of the squares of the vertical distaces from the poits to the lie is miimized. Remember it is the vertical distace that represets the error. To calculate the best fit lie you ca use the followig formulas. You do ot have to do this by had i this class. I show you the formulas i case you are iterested. SXY SSX ( x x)( y y) ( x x) xy x x y x b1 b y b1 x At the begiig of this sectio whe lookig at correlatio for the Dosage of drug ad Reductio i blood pressure example we idetified r. 9978 which idicates a high positive liear correlatio. This fact alog with the scatterplot supports the use of liear regressio i this case. With the above formulas, you ca calculate b 1. 118 ad b 3. 4. Therefore, the regressio model i this case is: y b b x y 3.4.118x 1 As I stated earlier, you will ot have to calculate the formula by had. Istead, I will provide computer output ad you eed to be able to aswer questios based o the output. The computer output (a regressio plot) for this example follows. 5

Regressio Plot Y = -3.4 +.118X R-Sq = 99.5 % 6 5 Pressure 4 3 1 1 3 4 5 Drug I this output, the equatio ad the R-squared value are give. If you look above the graph, you will see this iformatio. Notice the R-squared value for this example is exactly what we stated previously i this sectio of material. You eed to be able to get the r value based o the R-squared that is give i the output. All you have to do is take the square root of the R-squared value. The thig you have to be careful of is the directio of the relatioship. Remember that if the slope of the lie is positive the r is positive ad if the slope is egative the r is egative. Therefore, you must look at the slope i order to decide if r is positive or egative. I terms of the equatio, you eed to be able to use it for predictio. This is a pretty direct process as we will always be predictig Y based o X. Therefore, you will plug i for X ad solve for Y. For our example, predict the Reductio i Blood Pressure if 5 is the Dosage of Drug. y 3.4.118x y 3.4.118(5) y 6.1 6

Cosider the followig data ad software output, which give the weight (i thousads of pouds), X, ad gasolie mileage (miles per gallo), Y, for te differet automobiles. X.5 3. 4. 3.5.7 4.5 3.8.9 5.. Y 4 43 3 35 4 19 3 39 15 44 Regressio Plot Y = 7.158-1.6175 X S = 3.46 R-Sq = 9. % R-Sq(adj) = 91. % 45 35 Y 5 15 3 X 4 5 1. Calculate r.. Based o r ad the scatterplot is liear regressio justified i this case. 3. Predict the gas mileage for a car weighig 4. (4,) pouds. Aswers 1. r R.9. 959 This is egative because of the egative slope. Yes, liear regressio is justified, r. 75 ad the poits are spread reasoably about the lie o the scatterplot Y 7.158 1.6175X 7.158 1.6175 4. 5. 3. 51 7

Cautios with regressio There are two commo mistakes with regressio. You must be aware of the problems with extrapolatio ad extreme values. Iterpolatio predictig Y values for X values that are withi the rage of the scatter plot (this is what regressio should be used for) Extrapolatio predictig Y values for X values beyod the rage of the observatios (this should ot be doe with a basic regressio model, it is a complex problem) If our X variable rages from 1 to 5 as it does i the Dosage of drug ad Reductio i blood pressure example the it is reasoable to predict withi that rage. However, if you try ad predict for a X of 1 the you have o data idicatig that this relatioship holds at that value. It is quite possible that the relatioship chages beyod the rage of the data. There is o way to kow this without collectig data cosistet with the X values you wat to predict. The least-squares lie ad the r value ca be affected greatly by extreme data poits. I order to illustrate this we will look at some computer output. 6 Regressio Plot Y = 4.4338 + 1.61E-X R-Sq =. % 5 4 C1 3 1-1 1 C 3 4 5 Calculate the r value for the above data. r With a r of, we kow that there is o liear relatioship betwee X ad Y. 8

Regressio Plot Y = 3.99359 +.869956X R-Sq = 77.9 % 3 C1 1 1 C 3 Calculate r for the above data. r. 779.883 With a r of.883, which is bigger tha the criteria of.75 it seems like we have a strog relatioship. With further ivestigatio via the scatterplot, you will see that all of the data is i the bottom left of the graph except oe data poit which is extreme. What I actually did was take the data from the previous graph with a r value of ad add oe extreme value. Notice the extreme value makes the other data poits appear close together. They also appear umerically close sice the oe value is so extreme. Therefore, the r value is high because the poits are close to the lie. I this case liear regressio is ot justified. If you have a extreme value i a plot like i this case, you should remove the extreme value ad see if the relatioship still exists. I this case it does ot so liear regressio will ot work for this data. 9