Module 1 Linear Regression

Similar documents
Steps to take to do the descriptive part of regression analysis:

Describing Bivariate Relationships

Intermediate Algebra Summary - Part I

Session 4 2:40 3:30. If neither the first nor second differences repeat, we need to try another

2. LECTURE 2. Objectives

5.1 Bivariate Relationships

Reteach 2-3. Graphing Linear Functions. 22 Holt Algebra 2. Name Date Class

OHS Algebra 2 Summer Packet

Do not copy, post, or distribute

Bivariate Data Summary

4.1 Introduction. 4.2 The Scatter Diagram. Chapter 4 Linear Correlation and Regression Analysis

Scatterplots. 3.1: Scatterplots & Correlation. Scatterplots. Explanatory & Response Variables. Section 3.1 Scatterplots and Correlation

MPM2D - Practice Mastery Test #5

S12 - HS Regression Labs Workshop. Linear. Quadratic (not required) Logarithmic. Exponential. Power

6.1.1 How can I make predictions?

Mathematical Modeling

Red Hot Half-Life Modeling Nuclear Decay

NUMB3RS Activity: How Does it Fit?

Chapter 11. Correlation and Regression

Algebra II Notes Quadratic Functions Unit Applying Quadratic Functions. Math Background

Measuring Momentum: Using distance moved after impact to estimate velocity

Review of Section 1.1. Mathematical Models. Review of Section 1.1. Review of Section 1.1. Functions. Domain and range. Piecewise functions

3.7 Linear and Quadratic Models

Chapter 9. Correlation and Regression

Name Class Date. Residuals and Linear Regression Going Deeper

a) Do you see a pattern in the scatter plot, or does it look like the data points are

Reminder: Univariate Data. Bivariate Data. Example: Puppy Weights. You weigh the pups and get these results: 2.5, 3.5, 3.3, 3.1, 2.6, 3.6, 2.

BIVARIATE DATA data for two variables

Prof. Bodrero s Guide to Derivatives of Trig Functions (Sec. 3.5) Name:

Regressions of Olympic Proportions

POLI 443 Applied Political Research

Connecticut Common Core Algebra 1 Curriculum. Professional Development Materials. Unit 8 Quadratic Functions

determine whether or not this relationship is.

Introduce Exploration! Before we go on, notice one more thing. We'll come back to the derivation if we have time.

Section 2.5 from Precalculus was developed by OpenStax College, licensed by Rice University, and is available on the Connexions website.

PICKET FENCE FREE FALL

Quadratic and Other Inequalities in One Variable

Regression Using an Excel Spreadsheet Using Technology to Determine Regression

MATH CRASH COURSE GRA6020 SPRING 2012

Correlation and Regression

Contents 16. Higher Degree Equations

Final Exam - Solutions

Least-Squares Regression. Unit 3 Exploring Data

2.1 Scatterplots. Ulrich Hoensch MAT210 Rocky Mountain College Billings, MT 59102

NUMB3RS Activity: Fresh Air and Parabolas. Episode: Pandora s Box

Correlation A relationship between two variables As one goes up, the other changes in a predictable way (either mostly goes up or mostly goes down)

t. y = x x R² =

EXAM 3 Math 1342 Elementary Statistics 6-7

CURRICULUM CATALOG. Algebra I (3130) VA

AP Statistics Two-Variable Data Analysis

Harvard University. Rigorous Research in Engineering Education

Let the x-axis have the following intervals:

Complete Week 8 Package

Mt. Douglas Secondary

Bees and Flowers. Unit 1: Qualitative and Graphical Approaches

Business Analytics and Data Mining Modeling Using R Prof. Gaurav Dixit Department of Management Studies Indian Institute of Technology, Roorkee

SESSION 5 Descriptive Statistics

Section 7.2 Homework Answers

How spread out is the data? Are all the numbers fairly close to General Education Statistics

Chapter 16. Simple Linear Regression and Correlation

PHY 123 Lab 1 - Error and Uncertainty and the Simple Pendulum

Agile Mind Mathematics 8 Scope and Sequence, Texas Essential Knowledge and Skills for Mathematics

SAMPLE. 1.2 Prime factor form. 1.3 Finding the Highest Common Factor (HCF)

H.Algebra 2 Summer Review Packet

Chapter 10 Regression Analysis

Business Statistics. Chapter 14 Introduction to Linear Regression and Correlation Analysis QMIS 220. Dr. Mohammad Zainal

Correlation Coefficient: the quantity, measures the strength and direction of a linear relationship between 2 variables.

Chapter 12: Linear Regression and Correlation

Analysis of Variance and Co-variance. By Manza Ramesh

Six Sigma Black Belt Study Guides

Basic Business Statistics 6 th Edition

Two years of high school algebra and ACT math score of at least 19; or DSPM0850 or equivalent math placement score.

Chapter 4 - Writing Linear Functions

Chapter 3. Introduction to Linear Correlation and Regression Part 3

Linear Regression 3.2

Unit #2: Linear and Exponential Functions Lesson #13: Linear & Exponential Regression, Correlation, & Causation. Day #1

This lesson examines the average and

WISE Regression/Correlation Interactive Lab. Introduction to the WISE Correlation/Regression Applet

Unit 1: Statistical Analysis. IB Biology SL

Objectives. Materials

Ch. 16: Correlation and Regression

12.7. Scattergrams and Correlation

Lesson 1: Inverses of Functions Lesson 2: Graphs of Polynomial Functions Lesson 3: 3-Dimensional Space

Chapter 7 Linear Regression

Answer Key. 9.1 Scatter Plots and Linear Correlation. Chapter 9 Regression and Correlation. CK-12 Advanced Probability and Statistics Concepts 1

Lesson 3 Average Rate of Change and Linear Functions

Section 2.2: LINEAR REGRESSION

AP Stats ~ 3A: Scatterplots and Correlation OBJECTIVES:

Industrial Engineering Prof. Inderdeep Singh Department of Mechanical & Industrial Engineering Indian Institute of Technology, Roorkee

Chapter 10. Regression. Understandable Statistics Ninth Edition By Brase and Brase Prepared by Yixun Shi Bloomsburg University of Pennsylvania

Describing the Relationship between Two Variables

Algebra I Calculator Activities

Interactions. Interactions. Lectures 1 & 2. Linear Relationships. y = a + bx. Slope. Intercept

Chapter 10. Correlation and Regression. McGraw-Hill, Bluman, 7th ed., Chapter 10 1

Accel Alg E. L. E. Notes Solving Quadratic Equations. Warm-up

Prob/Stats Questions? /32

1 Measurement Uncertainties

Information Sources. Class webpage (also linked to my.ucdavis page for the class):

MINI LESSON. Lesson 2a Linear Functions and Applications

3 9 Curve Fitting with Polynomials

Transcription:

Regression Analysis Although many phenomena can be modeled with well-defined and simply stated mathematical functions, as illustrated by our study of linear, exponential and quadratic functions, the world about us is not as well-behaved as we might want it to be. The universe and everything in the universe is subjected to statistical variation, random forces, and multiple influences, some of which are known and controllable, and some of which are not. For example, we can use a quadratic function to model the path of a falling object wherein we assume that gravity is the only force to be considered. But, if wind resistance was an issue, our mathematical model would have to be more complex. The aerodynamic design of airplanes and automobiles, or the science of weather forecasting, are examples where scientists need elaborate mathematical models. The scientific method of investigation and making advances in knowledge is to gather data and to analyze that data with the aid of statistics and mathematical modeling. From this analysis, patterns emerge and we become ever more adept at understanding the phenomenon being studied. We will proceed to learn about a fundamental statistical and mathematical modeling technique that allows data to be analyzed in a controlled manner despite the uncertainty, and in some cases the chaos, that is present in the world about us. This data analysis technique is known as regression analysis. Regression analysis can be applied with a variety of mathematical functions, including linear functions, exponential functions and quadratic functions. We will begin our study of regression analysis with an example of linear regression. 1

Example of Linear Regression Suppose that eight college students who have completed two years of study at a particular higher education institution are selected at random. Assume that their respective scores on the American College Testing (ACT) college entrance exam prior to college admission were as follows and that their respective Grade Point Averages (GPA) after two years of college study are as follows: Student ACT Score Grade Point Average X 1 17 2.1 2 20 2.6 3 22 2.4 4 25 2.7 5 26 2.6 6 29 2.8 7 31 2.9 8 32 3.2 A TI-83 or TI-84 graphing calculator should be used to facilitate the analysis of this data. The ACT score data can be entered into list L1 using the STAT EDIT function of the calculator. The Grade Point Average (GPA) data can be entered into list L2 using the STAT EDIT function of the calculator. Then, a scatter plot diagram of this data can be set-up in the calculator using the 2ND STATPLOT function. The resulting scatter plot diagram can be viewed by using the ZOOM 9 function of the calculator. 2

Scatter Plot If the x-axis is used for ACT scores and the y-axis is used for GPA data, the eight points from the table above will result in the following scatter plot diagram: 3.20 3.00 2.80 GPA 2.60 2.40 2.20 2.00 15 20 25 30 35 ACT Test Score This depiction suggests that there may be an association, that is, some relationship between a student s ACT score at the time of admission to college and that student s GPA after two years of college study. That is, a higher ACT score somehow tends to be associated with a higher GPA. This is not a matter of asserting that there is a direct causal relationship between the ACT score and the respective GPA. There are many factors which influence and determine a student s GPA, including essential matters of self-determination and also unforeseen happenings in life beyond one s control. However, the data in the table above can be thought of as a statistical sample which can be used as the basis for inferring something about the overall population of students from which the sample is drawn, and making subsequent predictions about that student population with some measured level of confidence. 3

Regression Line In this particular example, the depiction of the graph suggests that it may be possible for a straight line to be used to model the unknown relationship between ACT score and GPA. The resulting line is known in mathematical theory as a regression line: a X b where the coefficients a and b are chosen in a way that makes the designated linear equation the line-of-best-fit for the given data points. From the STAT CALC menu, we use the LinReg ( ax + b ) function of our calculator to determine the regression line and we obtain the following results: LinReg a X b a 0.0565162907 b 1.235463659 If we truncate the values of the two coefficients, the linear equation of the regression line in this example can be stated as follows: 0.056 X 1.235 This is the equation of the straight line which best serves as a model of the relationship between ACT scores and GPA s in the given student population based on the random sample of eight students. This equation can be entered into the calculator via the = key and can be viewed in the display screen which has the scatter plot diagram by using the GRAPH key. The following is the depiction of the graph of this regression line superimposed on the scatter plot diagram. Note that some of the eight data points lie above the regression line and some lie below the regression line. 4

3.20 3.00 2.80 GPA 2.60 2.40 2.20 2.00 15 20 25 30 35 ACT Test Score Prediction Assuming for now that the sample data that we have is sufficiently correlated, the regression line can be used to predict the future GPA for an incoming student. Suppose that a new student being admitted to college has an ACT score of: x 23 It reasonably can be predicted that this student would have a GPA after two years of college as follows: 0.056 (23) 1.235 2.523 Because the regression line was derived from a random sample of eight students, it is a statistical estimate. This implies that there is some uncertainty in this prediction. Before we can investigate and measure the uncertainty in this prediction, we need to examine the concept of linear correlation. 5

Linear Correlation The concept of linear correlation is closely tied to the concept of linear regression. Linear correlation is a measure of the extent to which there is or is not a meaningful linear relationship in the data, that is, how well the regression line is or is not a meaningful representation of the data points in a scatter plot diagram. If the data points are too widely scattered, there is no linear relationship present and there is no reasonable basis for attempting to formulate a regression line. If the data points are closely aligned in a linear manner, the regression line becomes a meaningful and useful representation of that relationship. Our calculator is capable of producing a measure of linear correlation. Before executing the LinReg ( a X + b ) function in the STAT CALC menu, we should do the following if we wish to obtain a measure of linear correlation: 2ND CATALOG Cursor down to Diagnostics On and hit ENTER. Hit ENTER again. This will enable additional information when using the STAT CALC menu to execute the LinReg ( a X + b ) function: LinReg a 0.0565162907 b 1.235463659 r 2 The number: a X b 0.839830218 r 0.9164225107 r 0.9164225107 is an example of what is known as the linear correlation coefficient. This value implies that we have evidence of a strong linear relationship in the original data. 6

Positive Linear Correlation Coefficient The linear correlation coefficient is the measure of the extent to which there is or is not a meaningful linear relationship in the data in the scatter plot diagram. If the regression line has a positive slope: a 0 Then, the linear correlation is said to be positive linear correlation and the linear correlation coefficient would have a value in the following range: 0 r 1 If the value of the linear correlation coefficient is close to 1, a strong linear relationship exists, that is, the regression line is a useful representation of a linear relationship in the original X and data. In this case, the points in the scatter plot diagram are aligned in a pattern around the regression line. As the points in a scatter plot diagram become more randomly or widely dispersed, the value of the linear correlation coefficient becomes closer to zero. This means that there is progressively less evidence of a linear relationship in the original data and the regression line has little or no value. The boundaries between what constitutes weak, moderate or strong correlation are somewhat arbitrary. The following boundaries may be used: 0 0.5 0.8 1 Weak Moderate Strong 7

Negative Linear Correlation Coefficient In a subsequent example, we will examine a case in which the data results in a regression line that has a negative slope: a 0 In such a case, the linear correlation is said to be negative linear correlation and the linear correlation coefficient would have a value in the following range: 1 r 0 For negative linear correlation, the boundaries between what constitutes weak, moderate or strong correlation would be as follows: -1-0.8-0.5 0 Strong Moderate Weak 8

Standard Error of the Estimate The correlation coefficient allows us to obtain an assessment of the extent to which linear correlation is present in the original data. Also, mathematicians have devised a measure of the risk and uncertainty that is present in a regression line for a given sample of data. This measure is known as the standard error of the estimate. In order to calculate the standard error of the estimate, we use the equation of the regression line that was entered into our calculator. This allows us to obtain the following table of values for the predicted GPA s. These values can be derived by using the 2ND TABLESET and the 2ND TABLE functions on the calculator: ACT Score Predicted GPA X 17 2.187 20 2.355 22 2.467 25 2.635 26 2.691 29 2.859 31 2.971 32 3.027 9

Residuals Now, we calculate the following residual values and the square of each residual value: ACT Score Actual GPA Predicted GPA Residual Square of Residual 17 2.1 2.187-0.087 0.00757 20 2.6 2.355 0.245 0.06003 22 2.4 2.467-0.067 0.00449 25 2.7 2.635 0.065 0.00423 26 2.6 2.691-0.091 0.00828 29 2.8 2.859-0.059 0.00348 31 2.9 2.971-0.071 0.00504 32 3.2 3.027 0.173 0.02993 The residual is the difference between the actual GPA and the predicted GPA. For any given ACT score, it is the measure of the vertical gap between the actual GPA on the scatter plot diagram and the predicted GPA on the regression line. Next, we obtain the average value of the eight numbers in the last column of this table, that is, the sum of these numbers divided by eight: 0.12305 0.01538125 8 The standard error of the estimate is the square root of this average: Standard Error of the Estimate = 0.01538125 0. 12 This standard error of the estimate is the square root of the average value of the squares of the residuals. It is used to assess the risk and uncertainty that is present in the regression line. This risk and uncertainty is present because the regression line was derived from a random sample of data. The regression line is a statistical estimate, not a certain fact. 10

Confidence Interval The standard error of the estimate provides a way to measure the risk and uncertainty that is present when using the regression line to make predictions. For example, we conjectured that if a new student had an ACT score of 23, we could predict that student s GPA as follows: 0.056 (23) 1.235 2.523 There is risk and uncertainty in this prediction. To measure this risk and uncertainty, we multiply the standard error of the estimate by 2: 2 0.12 0.24 Then, we form what is known as a confidence interval around the predicted value. Using methods of statistical theory, it can be said that there is a 95% likelihood that this student s GPA will be in the following interval: 2.523 0.24 = 2.283 2.523 2.523 + 0.24 = 2.763 Predicted Value Minus Two Times the Standard Error of the Estimate Predicted Value Predicted Value Plus Two Times the Standard Error of the Estimate If we were to increase the number of students in the original random sample, a different regression line would be derived. And, the standard error of the estimate for that regression line would be correspondingly smaller, and the resulting confidence interval would be correspondingly narrower. That is, a larger random sample reduces the risk and uncertainty in our predictions and estimates. 11

Automobile Accident Rates Next, we will illustrate negative linear correlation with the following example. Assume that an insurance company wants to create a model of automobile accident rates to assist in its decision making regarding the insurance premiums that are to be charged for younger drivers in the coming year. It is a challenge to predict the future on the basis of past data. But, the business of insurance is that of risk management. If insurance rates are set too high, the insurance company will lose business because it is not competitive in the marketplace. If insurance rates are set too low, the insurance company will lose money because insurance claims from its policy holders will not be covered by the insurance premiums that were collected. So, insurance companies must collect and analyze data. Regression analysis is one of the techniques that are used in data analysis. Assume that data is gathered as follows, where X is the age in years of automobile drivers and is the reported accident rate per 1000 drivers in motor vehicle records of the recent past: Age in ears Accident Rate X 16 96 18 93 20 84 22 79 24 75 26 67 28 55 30 46 12

After entering the above data into our calculator via the STAT EDIT function, we execute the LinReg ( ax + b ) function from the STAT CALC menu: LinReg a X b a 3.541666667 b 155.8333333 r 2 0.9738509233 r 0.9868388538 In this example, we will retain two digits of decimal precision to obtain the following linear equation of the regression line for this data: 3.54 X 155.83 This equation is the line-of-best-fit which serves as a model of the relationship between driver age and driver accident rates. The correlation coefficient is: r 0.9868388538 Note that the value of this correlation coefficient is negative. Also note that the value of this correlation coefficient is close to 1. So, we have evidence of strong negative linear correlation. The scatter plot diagram and the regression line are displayed on the following graph. 120 100 Accident Rate 80 60 40 20 0 15 20 25 30 Age in ears 13

If we enter the equation of the regression line into our calculator, we obtain the following table of values for the predicted accident rates: Age in ears Predicted Accident Rate X 16 99.19 18 92.11 20 85.03 22 77.95 24 70.87 26 63.79 28 56.71 30 49.63 Next, we calculate the standard error of the estimate from the following data: Age in ears Actual Accident Rate Predicted Accident Rate Residual Square of Residual 16 96 99.19-3.19 10.176 18 93 92.11 0.89 0.792 20 84 85.03-1.03 1.061 22 79 77.95 1.05 1.103 24 75 70.87 4.13 17.057 26 67 63.79 3.21 10.304 28 55 56.71-1.71 2.924 30 46 49.63-3.63 13.177 The residual is the difference between the actual accident rate and the predicted accident rate. For any given driver age, it is the measure of the vertical gap between the actual accident rate on the scatter plot diagram and the predicted accident rate on the regression line. 14

Next, we obtain the average value of the eight numbers in the last column of this table, that is, the sum of these numbers divided by eight: 56.594 7.07425 8 The standard error of the estimate is the square root of this average: Standard Error of the Estimate = 7.07425 2. 66 The regression line can be used to predict the accident rate in the coming year. For example, for a driver of age = 19 years, the predicted accident rate in the coming year would be: 3.54 (19) 155.83 88.57 To calculate a confidence interval around this prediction, we multiply the standard error of the estimate by 2: 2 2.66 5.32 The confidence interval for the predicted accident rate is determined as follows. For drivers of age = 19 years, there is a 95% likelihood that the accident rate in the coming year will be in the following interval: 88.57 5.32 = 83.25 88.57 88.57 + 5.32 = 93.89 Predicted Value Minus Two Times the Standard Error of the Estimate Predicted Value Predicted Value Plus Two Times the Standard Error of the Estimate 15