Chapter 10 Regression Analysis

Similar documents
Chapter 10 Correlation and Regression

Chapter 12 Summarizing Bivariate Data Linear Regression and Correlation

Inferences for Regression

Chapter 23. Inference About Means

CRP 272 Introduction To Regression Analysis

Chapter 7 Linear Regression

Simple Linear Regression

Math 52 Linear Regression Instructions TI-83

Correlation and Regression (Excel 2007)

Simple Linear Regression Using Ordinary Least Squares

Data Analysis and Statistical Methods Statistics 651

Warm-up Using the given data Create a scatterplot Find the regression line

appstats27.notebook April 06, 2017

Prob and Stats, Sep 23

Approximate Linear Relationships

Chapter 27 Summary Inferences for Regression

bivariate correlation bivariate regression multiple regression

MINI LESSON. Lesson 2a Linear Functions and Applications

REVIEW 8/2/2017 陈芳华东师大英语系

Intro to Linear Regression

Chapter 23. Inferences About Means. Monday, May 6, 13. Copyright 2009 Pearson Education, Inc.

Regression Equation. April 25, S10.3_3 Regression. Key Concept. Chapter 10 Correlation and Regression. Definitions

Regression Equation. November 28, S10.3_3 Regression. Key Concept. Chapter 10 Correlation and Regression. Definitions

Intro to Linear Regression

Linear Motion with Constant Acceleration

Business Statistics. Lecture 9: Simple Regression

Correlation: basic properties.

Assumptions, Diagnostics, and Inferences for the Simple Linear Regression Model with Normal Residuals

Data Analysis, Standard Error, and Confidence Limits E80 Spring 2015 Notes

Ordinary Least Squares Regression Explained: Vartanian

Harvard University. Rigorous Research in Engineering Education

CREATED BY SHANNON MARTIN GRACEY 146 STATISTICS GUIDED NOTEBOOK/FOR USE WITH MARIO TRIOLA S TEXTBOOK ESSENTIALS OF STATISTICS, 3RD ED.

determine whether or not this relationship is.

Correlation and Linear Regression

Chapter 5 Least Squares Regression

Data Analysis, Standard Error, and Confidence Limits E80 Spring 2012 Notes

2. LECTURE 2. Objectives

1. The Basic X-Y Scatter Plot

Module 8: Linear Regression. The Applied Research Center

LECTURE 15: SIMPLE LINEAR REGRESSION I

Linear Regression. Linear Regression. Linear Regression. Did You Mean Association Or Correlation?

Correlation A relationship between two variables As one goes up, the other changes in a predictable way (either mostly goes up or mostly goes down)

appstats8.notebook October 11, 2016

Final Exam - Solutions

Hypothesis testing. Data to decisions

Chapter 6. Exploring Data: Relationships. Solutions. Exercises:

Business Statistics. Lecture 10: Course Review

TOPIC 9 SIMPLE REGRESSION & CORRELATION

Describing the Relationship between Two Variables

SOLVING EQUATIONS. Judo Math Inc.

10.1 Simple Linear Regression

ACCESS TO SCIENCE, ENGINEERING AND AGRICULTURE: MATHEMATICS 1 MATH00030 SEMESTER / Lines and Their Equations

Ch. 16: Correlation and Regression

Study and research skills 2009 Duncan Golicher. and Adrian Newton. Last draft 11/24/2008

Stat 101 L: Laboratory 5

MA 1128: Lecture 08 03/02/2018. Linear Equations from Graphs And Linear Inequalities

Psy 420 Final Exam Fall 06 Ainsworth. Key Name

Ratios, Proportions, Unit Conversions, and the Factor-Label Method

Introduction to Uncertainty and Treatment of Data

Fish act Water temp

The Simple Linear Regression Model

UNST 232 Mentor Section Assignment 5 Historical Climate Data

Hypothesis testing I. - In particular, we are talking about statistical hypotheses. [get everyone s finger length!] n =

One-sample categorical data: approximate inference

Basic Business Statistics 6 th Edition

Unit #2: Linear and Exponential Functions Lesson #13: Linear & Exponential Regression, Correlation, & Causation. Day #1

Solving with Absolute Value

EAS 535 Laboratory Exercise Weather Station Setup and Verification

Chapter 9 Ingredients of Multivariable Change: Models, Graphs, Rates

Correlation Analysis

Midterm 2 - Solutions

Module 1 Linear Regression

Unit 6 - Simple linear regression

Lesson 4 Linear Functions and Applications

6A Lab Post-test. Instructions for printed version. Week 10, Fall 2016

the probability of getting either heads or tails must be 1 (excluding the remote possibility of getting it to land on its edge).

where Female = 0 for males, = 1 for females Age is measured in years (22, 23, ) GPA is measured in units on a four-point scale (0, 1.22, 3.45, etc.

AP STATISTICS Name: Period: Review Unit IV Scatterplots & Regressions

Ch. 7 Statistical Intervals Based on a Single Sample

Physics 6A Lab Experiment 6

Final Exam - Solutions

Coulomb s Law. 1 Equipment. 2 Introduction

Chapter 8. Linear Regression. Copyright 2010 Pearson Education, Inc.

Chapter 12 : Linear Correlation and Linear Regression

Precalculations Individual Portion Correlation and Regression: Statistical Analysis of Trends

Foundations for Functions

Predicted Y Scores. The symbol stands for a predicted Y score

Chapter 8. Linear Regression. The Linear Model. Fat Versus Protein: An Example. The Linear Model (cont.) Residuals

Week 11 Sample Means, CLT, Correlation

LAB 5 INSTRUCTIONS LINEAR REGRESSION AND CORRELATION

PS5: Two Variable Statistics LT3: Linear regression LT4: The test of independence.

CHAPTER 10. Regression and Correlation

Math 124: Modules Overall Goal. Point Estimations. Interval Estimation. Math 124: Modules Overall Goal.

Competing Models, the Crater Lab

Talking feet: Scatterplots and lines of best fit

Regression M&M 2.3 and 10. Uses Curve fitting Summarization ('model') Description Prediction Explanation Adjustment for 'confounding' variables

Interactions. Interactions. Lectures 1 & 2. Linear Relationships. y = a + bx. Slope. Intercept

Chapter 19 Sir Migo Mendoza

In this activity, students will compare weather data from to determine if there is a warming trend in their community.

Section 3: Simple Linear Regression

Transcription:

Chapter 10 Regression Analysis Goal: To become familiar with how to use Excel 2007/2010 for Correlation and Regression. Instructions: You will be using CORREL, FORECAST and Regression. CORREL and FORECAST are found on the Stat Menu and Regression is found in the Data Analysis Group. CORREL Select CORREL from the Stat Menu and here is what you should see: Typically, the data is organized so that Array1 is in Column 1 and Array 2 is in Column 2. After you have selected the data, the tool will return the Correlation Coefficient. This is a number between -1 and +1 and is explained further in the notes below. FORECAST Select FORECAST from the Stat Menu. You should see: 11

REGRESSION Select Regression from the Data Analysis Menu. You should see: You just have to be a little careful when you enter the data. Take note that it asks for the Y data first. Data is entered in the typical way by selecting the desired data. If you have labels at the top of your data columns, and want the labels to carry over, then check the Labels box. As before, select Output Range and then select some cell on the worksheet. Lastly, check the Line Fit Plots so you get a chart showing the data and the regression line. Be careful how you enter data into the tool. Pay close attention to which column of data you select for your Known_y s and for your Known_x s. Don t mix them up. In our example, the weight data would be the Known_y s. Once you ve entered the input data, you can make predictions about various males by entering various values for their height into the input field labeled X. the tool returns the predicted value, in this case, for weight. Here is a typical output with the important fields highlighted: 12

Multiple R This is the correlation coefficient R Square This is the coefficient of determination Standard Error This is the standard error, Significance F This is play the exact same role as p-value did for hypotheses testing. If this value is less than alpha, then the regression is significant. Intercept This is the y-intercept of the linear regression equation Slope This is the slope of the linear regression equation. The following is the chart output. The markers in blue are the actual data values and the red line is the linear regression line. 13

Regression Analysis involves the study of ordered pairs of data, such as (X,Y). If a strong linear relationship exists between X and Y, then given a value of X, we can make a prediction about what Y should be. Consider height and weight. There is a strong correlation between the two as well as a significant linear relationship, i.e. we can express weight as a linear function of height. Let s say that the average weight of all males in the United States is 170 lbs. If we were to pick some male at random, without seeing the person, the best guess of this person s weight would be 170 lbs, the population mean. However, if we knew of a linear relationship between height and weight, and we knew the person s height, we could make a better guess of the selected male s weight. Let s say that weight is related to height by the following linear equation: where W is in lbs and H is in inches. Let s say that we know that the selected male is 72 tall. We would then estimate the person s weight to be guess that the person weighed 188.5 lbs which would be a better guess than 170, the average of the whole male population. Knowing the linear relationship that exists between weight and height enables us to make better predictions than just knowing the population mean. Correlation A correlation exists between two variables when there exists a relationship between the two. In other words, one can be used to predict the value of the other. In this class we will study those correlations when the relationship is a linear one, i.e. one variable can be expressed as a linear function of the other. The following formula shows a linear relationship between y and x: and are constant numbers, for example, -2 and 10: In this example, if x were to equal 4 then y would equal: ( ) Regression is somewhat similar to point estimation because and are population parameters, and we calculate an and from our regression equations, which are estimates of and. Let s take a look at an example. The following table shows the costs of subway fare and a slice of pizza in New York City from 1960 through 2000: 1960 1973 1986 1992 2000 2004 Cost of Pizza 0.15 0.35 1.00 1.35 1.75 2.00 Subway Fare 0.15 0.35 1.00 1.25 1.50 2.00 14

It certainly looks like there exists a linear relationship between the two sets of data. We can measure the strength of this relationship using the Correlation Coefficient r. The Excel tool we use is CORREL. For the data in this example, CORREL returns a value of 0.998. Use the following table of r values to determine if the correlation is statistically significant: Sample Size r Correlation can be positive, negative or near zero. The following figures show the relationship between the sign of the correlation coefficient and the arrangement of the data. Observe how the data is randomly distributed about the mean of 0.998 is clearly significant. when r is 0. A correlation coefficient Regression Now that we have seen that the cost of pizza is highly correlated with the cost of subway fare, we can use one to predict the other. The Excel tool that we use for this is FORECAST. For example, if we input the value of the cost of pizza, such as 1.00, the tool would return the predicted value of 0.98. Notice that this does not equal the 1.00 in the table above for the cost of Subway Fare. Remember, FORECAST uses the data in the table to make estimates of and, and it uses these estimates to calculate predicted values. There s no reason to believe that a predicted value would equal an actual value in a sample, but it is close. The idea behind Linear Regression Analysis is to find a straight line that best fits the data. Look at the chart below. For a given value of x, there are two values of y, y the data point and the predicted value lying on the red line. The difference between y and is something we would like to minimize by moving the red line around. We can change the slope and the intercept of the line, but we can t change the location of the actual data values. I hope you can see that by moving the red line around, we can minimize the difference between some values but at the expense of others. However, there is one line that minimizes the differences better than all other lines, and that line is the regression line. Therefore, what we mean by best fit is finding the statistics, and such that the line, minimizes the sum of all those differences. 15

The data is from a random sample, and so, one would expect it to be scattered around a bit, but see how the straight line does a pretty good job of fitting the data. Regression Analysis lets us make better predictions of our data if we know something about the data. For example, let s suppose that during the work week, it takes you an average of 5.6 minutes to get out of bed. If I wanted to predict how long it would take you to get out of bed on Thursday, the best I can do would be to say 5.6 minutes. Now, suppose that there exist a linear relationship between how long it takes you to get out of bed and the day of the work week. Consider the following table: Day of Week 1 2 3 4 5 Time (mins) 10 8 5 3 2 The correlation coefficient is. Now, let s see what we would predict for Thursday. FORECAST returns 3.5 minutes which is a much better estimate than 5.6 minutes. Therefore, you can see that if the data has a linear relationship associated with it, we can use that relationship to make better predictions than just using the sample data. Regression is also useful for making predictions slightly into the future. For example, looking back at the pizza/subway example, if in the year 2010 we knew that pizza would cost $2.25, we could predict that subway fare would be 2.1138 or more likely, $2.10. Confidence Intervals is an estimate, and therefore there is some uncertainty associated with it. This situation is identical to when we were using samples to estimate the mean of a population. was the best estimate of but we understood that we need an interval, called a confidence interval, in which we could have, say 95% confidence, that it provided a lower and upper bound for the possible values of. We do exactly the same thing here, except the formula for calculating E, the margin of error is different. Previously, it was understood that the population was normally distributed, and that is also a necessary condition here. Both the X and Y data must be normally distributed in order to proceed with confidence intervals. 16

The formula for calculating E, the margin of error is ( ) ( ) is the two-tailed critical value for significance that we used before for confidence intervals is the standard error that we get when we run the Regression Analysis (see the printout above) is the mean of the X data is the sample size ( ) are the sum of the X data squared and the square of the sum of the X data respectively. Given a value for x, say we can calculate the predicted value of y, If we were to measure everyone in the population, and ran a regression of all that data, then we would get and, the regression parameters for the entire population. and are estimates of and, respectively, and so, would be an estimate of just like was an estimate for And just like we needed an interval in which we could have, say 95% confidence that it provided and upper and lower bound for we need just such an interval for That interval is given by ( ) One of the things we should notice about E, is that it is a function of The further is from the greater will be ( ) and hence, the greater will be E. In other, the confidence interval is not constant for all but becomes wider as moves away from Important things to take note of 1. Correlation does not imply causality. Just because A and B are highly correlated does not mean that A is the cause of B. Pizza slice costs and subway fares are highly correlated, but would anyone infer that rising pizza costs were the cause of rising subway fares? Now apply this argument to global warming and the increasing levels of greenhouse gases in the atmosphere. Is the latter the cause of the former? 2. There is a limit to how far into the future we can reliably make predictions using the regression equations. We cannot forecast the global temperature thirty years from now any more than we can precisely forecast the weather for any particular day one year from now. 3. Be careful about outliers. You can t use them because they will severely throw off the results, but you can t ignore them either. Document the fact that you re not using them and give your best reason for doing so. 17