SIMPLE LINEAR REGRESSION

Similar documents
Chapter 9: Statistical Inference and the Relationship between Two Variables

Statistics for Managers Using Microsoft Excel/SPSS Chapter 13 The Simple Linear Regression Model and Correlation

Statistics MINITAB - Lab 2

Statistics for Economics & Business

x yi In chapter 14, we want to perform inference (i.e. calculate confidence intervals and perform tests of significance) in this setting.

/ n ) are compared. The logic is: if the two

Statistics for Business and Economics

Basic Business Statistics, 10/e

Department of Quantitative Methods & Information Systems. Time Series and Their Components QMIS 320. Chapter 6

Chapter 11: Simple Linear Regression and Correlation

Resource Allocation and Decision Analysis (ECON 8010) Spring 2014 Foundations of Regression Analysis

Correlation and Regression. Correlation 9.1. Correlation. Chapter 9

Chapter 14 Simple Linear Regression

[The following data appear in Wooldridge Q2.3.] The table below contains the ACT score and college GPA for eight college students.

Here is the rationale: If X and y have a strong positive relationship to one another, then ( x x) will tend to be positive when ( y y)

2016 Wiley. Study Session 2: Ethical and Professional Standards Application

Lecture 6: Introduction to Linear Regression

1. Inference on Regression Parameters a. Finding Mean, s.d and covariance amongst estimates. 2. Confidence Intervals and Working Hotelling Bands

Chapter 13: Multiple Regression

Linear Correlation. Many research issues are pursued with nonexperimental studies that seek to establish relationships among 2 or more variables

Linear Regression Analysis: Terminology and Notation

Scatter Plot x

Learning Objectives for Chapter 11

β0 + β1xi. You are interested in estimating the unknown parameters β

Measuring the Strength of Association

Lecture Notes for STATISTICAL METHODS FOR BUSINESS II BMGT 212. Chapters 14, 15 & 16. Professor Ahmadi, Ph.D. Department of Management

e i is a random error

Lecture 9: Linear regression: centering, hypothesis testing, multiple covariates, and confounding

Lecture 9: Linear regression: centering, hypothesis testing, multiple covariates, and confounding

APPENDIX 2 FITTING A STRAIGHT LINE TO OBSERVATIONS

Introduction to Regression

Chapter 2 - The Simple Linear Regression Model S =0. e i is a random error. S β2 β. This is a minimization problem. Solution is a calculus exercise.

Linear regression. Regression Models. Chapter 11 Student Lecture Notes Regression Analysis is the

28. SIMPLE LINEAR REGRESSION III

STAT 511 FINAL EXAM NAME Spring 2001

18. SIMPLE LINEAR REGRESSION III

Lecture 3 Stat102, Spring 2007

Chapter 3 Describing Data Using Numerical Measures

Chapter 8 Indicator Variables

Comparison of Regression Lines

β0 + β1xi. You are interested in estimating the unknown parameters β

STATISTICS QUESTIONS. Step by Step Solutions.

The Multiple Classical Linear Regression Model (CLRM): Specification and Assumptions. 1. Introduction

Biostatistics. Chapter 11 Simple Linear Correlation and Regression. Jing Li

Negative Binomial Regression

Statistical Inference. 2.3 Summary Statistics Measures of Center and Spread. parameters ( population characteristics )

Sociology 301. Bivariate Regression. Clarification. Regression. Liying Luo Last exam (Exam #4) is on May 17, in class.

LECTURE 9 CANONICAL CORRELATION ANALYSIS

Economics 130. Lecture 4 Simple Linear Regression Continued

The Ordinary Least Squares (OLS) Estimator

Mathematics Intersection of Lines

Psychology 282 Lecture #24 Outline Regression Diagnostics: Outliers

Correlation and Regression

Interval Estimation in the Classical Normal Linear Regression Model. 1. Introduction

The topics in this section concern with the second course objective. Correlation is a linear relation between two random variables.

CORRELATION AND REGRESSION

Systematic Error Illustration of Bias. Sources of Systematic Errors. Effects of Systematic Errors 9/23/2009. Instrument Errors Method Errors Personal

Lecture 16 Statistical Analysis in Biomaterials Research (Part II)

LINEAR REGRESSION ANALYSIS. MODULE IX Lecture Multicollinearity

Statistics II Final Exam 26/6/18

x = , so that calculated

ECONOMICS 351*-A Mid-Term Exam -- Fall Term 2000 Page 1 of 13 pages. QUEEN'S UNIVERSITY AT KINGSTON Department of Economics

CORRELATION AND REGRESSION

Statistical Evaluation of WATFLOOD

LINEAR REGRESSION ANALYSIS. MODULE IX Lecture Multicollinearity

III. Econometric Methodology Regression Analysis

This column is a continuation of our previous column

Module 2. Random Processes. Version 2 ECE IIT, Kharagpur

STAT 3340 Assignment 1 solutions. 1. Find the equation of the line which passes through the points (1,1) and (4,5).

Gravitational Acceleration: A case of constant acceleration (approx. 2 hr.) (6/7/11)

STAT 3008 Applied Regression Analysis

A Robust Method for Calculating the Correlation Coefficient

BIO Lab 2: TWO-LEVEL NORMAL MODELS with school children popularity data

Section 8.3 Polar Form of Complex Numbers

T E C O L O T E R E S E A R C H, I N C.

Rockefeller College University at Albany

Econ107 Applied Econometrics Topic 3: Classical Model (Studenmund, Chapter 4)

a. (All your answers should be in the letter!

Statistics Chapter 4

Laboratory 3: Method of Least Squares

17 - LINEAR REGRESSION II

UNIVERSITY OF TORONTO Faculty of Arts and Science. December 2005 Examinations STA437H1F/STA1005HF. Duration - 3 hours

x i1 =1 for all i (the constant ).

Chemometrics. Unit 2: Regression Analysis

Y = β 0 + β 1 X 1 + β 2 X β k X k + ε

Econ Statistical Properties of the OLS estimator. Sanjaya DeSilva

Introduction to Analysis of Variance (ANOVA) Part 1

Dr. Shalabh Department of Mathematics and Statistics Indian Institute of Technology Kanpur

j) = 1 (note sigma notation) ii. Continuous random variable (e.g. Normal distribution) 1. density function: f ( x) 0 and f ( x) dx = 1

Chapter 15 - Multiple Regression

Basically, if you have a dummy dependent variable you will be estimating a probability.

Definition. Measures of Dispersion. Measures of Dispersion. Definition. The Range. Measures of Dispersion 3/24/2014

Week3, Chapter 4. Position and Displacement. Motion in Two Dimensions. Instantaneous Velocity. Average Velocity

Laboratory 1c: Method of Least Squares

where I = (n x n) diagonal identity matrix with diagonal elements = 1 and off-diagonal elements = 0; and σ 2 e = variance of (Y X).

since [1-( 0+ 1x1i+ 2x2 i)] [ 0+ 1x1i+ assumed to be a reasonable approximation

Reduced slides. Introduction to Analysis of Variance (ANOVA) Part 1. Single factor

Properties of Least Squares

Q1: Calculate the mean, median, sample variance, and standard deviation of 25, 40, 05, 70, 05, 40, 70.

χ x B E (c) Figure 2.1.1: (a) a material particle in a body, (b) a place in space, (c) a configuration of the body

Transcription:

Smple Lnear Regresson and Correlaton Introducton Prevousl, our attenton has been focused on one varable whch we desgnated b x. Frequentl, t s desrable to learn somethng about the relatonshp between two (or more) varables. For example, we mght be nterested n studng the relatonshp between o cholesterol level and age, o blood pressure and age, o heght and weght o the amount of exercse and heart rate; o the concentraton of an njected drug and heart rate o the consumpton level of some nutrent and weght gan. The nature and strength of the relatonshps between two varables ma be examned b regresson and correlaton analses, two related statstcal technques that serve dfferent purposes. Regresson s used to dscover the probable form of the relatonshp between two varables x and b fndng an approprate equaton. The ultmate objectves when ths method of analss s emploed usuall s to predct or estmate the value of one varable correspondng to a gven value of another varable.e. to predct or estmate the value of for a gven value of x. Correlaton analss, on the other hand, s concerned wth measurng how strong s the relatonshp between two varables x and.e. the degree of the correlaton between the two varables. SIMPLE LINEAR REGRESSION In smple lnear the varable x s usuall referred to as the ndependent varable, snce frequentl t s controlled b the nvestgator; that s; values of x ma be selected b the nvestgator and, correspondng to each preselected value of x, one -or more- value of s obtaned. The other varable,, accordngl, s called the dependent varable, and we speak of the regresson of on x. In the above examples, the nvestgator could control, the age but not the cholesterol level, the weght but not heght, the concentraton of njected drug but not the heart rate.. and so on. We assume that for each value of x, there s a whole populaton of values whch s normall dstrbuted and all of the populatons have equal varances. In smple lnear regresson the object of the researcher s nterest s the regresson equaton that descrbes the true relatonshp between the dependent varable and the ndependent varable x. Scatter dagram A frst step that s usuall useful n studng the relatonshp between two varables s to prepare a scatter dagram of the data. The ponts are plotted b assgnng values of the ndependent varable x to the horzontal axs and values of the dependent varable to the vertcal axs. The pattern made b the ponts plotted on the scatter dagram usuall suggests the basc nature and the strength of the relatonshp between two varables. 11

Example Relatonshp between and optcal denst Optcal denst 3 4. 4.5.5 5 5.5 3 6 5 6.5.47 7.49 7.5.53 Optcal denst.6.5.4. 3 4 5 6 7 8 In our example, we can see, n general, that as the ncreases the optcal denst also ncreases so that the have a postve relatonshp. The least-square lne We can also see that the ponts seem to be scattered around an nvsble lne whch would descrbe the relatonshp between x and. These mpressons suggest that the relatonshp between ponts n the two varables ma be descrbed b a straght lne crossng the -axs near the orgn and makng approxmatel a 45 degree angle wth the x-axs. Thnkng Challenge It looks as ths lne would be eas to draw b hand, but t s doubtful that the lnes drawn b an two people would be exactl the same. In other words, for ever person drawng such a lne b ee, or freehand, we would expect a dfferent lne. Whch lne best descrbes relatonshp between the varables? What s needed for obtanng the desred lne?.6.6.5.5 Optcal denst.4. Optcal denst.4. 3 4 5 6 7 8 3 4 5 6 7 8 111

Answer If the scatter dagram has a lnear trend, we need a mathematcal wa to obtan the best lne through the data. We need to emplo a method known as the method of least squares for obtanng the desred lne, and the resultng lne s called the least-square lne. The reason for callng the method b ths name wll be explaned n the dscusson that follow. Equaton for straght lne (Lnear Equaton) Now, recall from algebra that the general equaton for straght lne s gven b = a + bx Where s a value on the vertcal axs, and x s a value on the horzontal axs, a s the pont where the lne crosses the vertcal axs, and referred to as -ntercept. b shows the amount b whch changes for each unt change n x and referred to as the slope of the lne. = a + bx b = slope Change n Change n x a = ntercept x To draw a lne based on the equaton, we need the numercal values of the constants a and b. Gven these constants, we ma substtute varous values of x nto the equaton to obtan correspondng values of. = a + bx The resultng ponts ma then be plotted. Computaton Fndng the b-value ( )( ) n x ( x ) n x x b = ( )( ) ( 9)( 84) ( 49) 9 18. -(49)(3.4) b = =.958 11

Fndng the -ntercept (x) a= bx where = mean of values and x = mean of x values 3.4 = = 378 9 49 x = = 5.444 9 a = 378 (.958)( 5.444) = -837 Optcal denst () x x 3 9.1 4. 16.4.8 4.5.5.5.65 1.15 5 5 4 1.6 5.5 3 3.5 89 1.815 6 5 36 5.1 6.5.47 4.5.9 3.55 7.49 49.4 3.43 7.5.53 56.5.81 3.975 Total Σ x = 49 Σ = 3.4 Σ x = 84 Σ = 1.188 Σ x = 18. Mean x = 5.444 = 378 Alternatvel a b x = n The equaton for the least squares lne s: = a + bx = - 837+.958x =.958x - 837 Note that we use the smbol because ths value s computed from the equaton and s not an observed value of. Now, we can substtute varous values of x nto the equaton to obtan correspondng values of. The resultng ponts ma be plotted. 113

Example: Predctng for a gven x usng the regresson equaton Choose a value for x (wthn the range of x values). x = 6.8 Substtute the selected x n the regresson equaton. =.958 6.8-837 Determne correspondng value of. =.958x - 837 =.465 Accordng to the equaton, a of 6.8 would has a.465 optcal denst. Drawng the least-squares lne Snce an two such coordnates determne a straght lne, we ma select an two values n the range of x, compute two correspondng values, locate them on a graph, and connect them wth a straght lne to obtan the lne correspondng the equaton. The followng pont wll alwas be on the least squares lne: ( x, ) Use 5.444 and 378, the averages of the x s and the s, respectvel. Tr x = 4, Compute: =.957(4) - 835 = 965 Sketchng the Lne Usng the Ponts (5.444, 378) and (4, 965) Optcal denst.6.5.4. =.957x - 835 3 4 5 6 7 8 Now what we have obtaned s what s called the best lne for descrbng the relatonshp between our two varables. B what crteron t s consdered best? Before the crteron s stated, let us examne the fgure obtaned. Note that the least squares lne does not pass through most of the observed ponts that are plotted on the scatter dagram. In other words, the observed ponts devate from the lne b varng amounts. 114

Optcal denst.6.5.4. Devaton Devaton Devaton 3 4 5 6 7 8 The lne that we have drawn s best n ths sense: The sum of the squared vertcal devatons of the observed data ponts ( ) from the least square lne s smaller than the sum of the squared vertcal devatons of the observed data ponts from an other lne. In other words, f we square the vertcal dstance from the observed pont ( ) to the least-squares lne and add these squared values for all ponts, the resultng total wll be smaller than the smlarl computed total for an other lne that can be drawn through the ponts. For ths reason the lne we have drawn s called the least-squares lne. Evaluaton the strength of the regresson equaton One wa to evaluate the strength of the regresson equaton n descrbng the relatonshp between two varables s to compare the scatter of the ponts about the regresson lne wth the scatter about, the mean of the values of. To do that, draw through the ponts a lne that ntersects the -axs at and s parallel to the x-axs, b dong so, we ma obtan a vsual mpresson of the relatve magntudes of the scatter of the ponts about ths lne and the regresson lne. Ths has been done n the next Fgure..6 It appears from the Fgure that =.957x - 835 the scatter of the ponts about the regresson lne s much less than the scatter of ponts about lne. But, the stuaton ma not be alwas ths cleat-cut, so some sort of calculaton to evaluate the strength of the regresson equaton s necessar, that s,.5.4. = 378 the coeffcent of 3 4 5 6 7 8 determnaton r. Optcal denst 115

The logc behnd the computaton of "coeffcent of determnaton". We begn b consderng the pont correspondng to an observed value,, and b measurng ts vertcal dstance from the ( ) lne. We call ths total devaton. If we measure the vertcal dstance from the regresson lne to the lne, we obtan, whch s called the explaned devaton, snce t shows b how much the total devaton s reduced when the regresson lne s ftted to the ponts. Fnall, we measure the vertcal dstance of the observed pont from the regresson lne to obtan, whch s called the unexplaned devaton, snce t represents the porton of the total devaton not explaned or accounted for b the ntroducton of the regresson lne..6 =.957x - 835 Optcal denst.5.4. Unexplaned devaton Explaned devaton = 378 Total devaton ( ) 3 4 5 6 7 8 It seen then, that the total devaton for a partcularl, s equal to the sum of the explaned and unexplaned devatons ( ) = + total explaned unexplaned = + devaton devaton devaton If we measure these devatons for each value of and, square each devaton, and add up the squared devatons, we have ( ) = + total sum explaned sum unexplaned sum = + of squares ( SST) of squares ( SSR) of squares ( SSE) We ma express the relatonshp between the three sums of squares as SST = SSR + SSE 116

It s ntutvel appealng to speculate that f a regresson equaton does a good job of descrbng the relatonshp between two varables, the explaned sum of squares should consttute a large proporton of the total sum of squares. The next fgure llustratng that the explaned devaton consttute a small proporton of the total devaton, as compared wth the prevous fgure.6.5 Unexplaned devaton Optcal denst.4. Explaned devaton = 378 Total devaton ( ) 3 4 5 6 7 8 It would be of nterest then, to determne the magntude of ths proporton b computng the rato of the explaned sum of squares to the total sum of squares. Ths s exactl what s done n evaluatng a regresson equaton based on sample data, and the result s called the sample coeffcent of determnaton, r. In other words, Optcal denst ( ) ( ) r Explaned sum of squares SSR 5773 = = = = =.9784 Total sum of squares SST 6136 ( ) -8.57 4 -.4 1.9X1-5 -4.55. -38.19 99.1 4.9 X1-5 -38.19.5 -.88.8.47.3 8.1 X1-5 -.91.8 -.18..95.5.65-.43. 3 -.8. 43 -.13.1651.5. 5.1. 91 -.41.16565.53.3.47 3.17.439.31.98911.1.49 5.3.486.4 1.9 X1-5 49..53 9.37.534 -.4 1.81 X1-5 96.39 SST = SSE = SSR = =38 6136.348 5773 117

.6.5 =.957 x - 835 R =.9784 Optcal denst.4. 3 4 5 6 7 8 Thus, the coeffcent of determnaton measures the closeness of ft of the regresson equaton to observed values of. Interpretaton of r If r =.9784 Approxmatel 98 percent of the varaton n Optcal denst () s explaned b the lnear relatonshp wth x, change. Less than three percent s explaned b other causes. From the table, we see that when the quanttes, the vertcal dstances of observed values of from the equaton, are small, the unexplaned sum of squares (SSE) s small. Ths leads to a large explaned sum of squares that leads n turn, to a large value of r. Ths s llustrated n the fgures In ths fgure, we see, that the observatons all le close to the regresson lne, and we would expect r to be large. Small unexplaned devaton Large explaned devaton x 118

In ths fgure, we see, that the observatons wdel scattered about the regresson lne, and we would expect r to be small. Large unexplaned devaton Small explaned devaton In ths fgure, we see, that all the observatons fall on the regresson lne, and we would expect the largest value of r whch equal to 1. x Explaned devaton = total devaton In ths fgure, we see, that the regresson lne and the lne drawn through concde, and we would expect the lowest value of r whch s close to. x Explaned devaton = x 119

Coeffcent of Determnaton Examples Y r = 1 r = 1 Y r = 1 Y X r =.8 Y r = X X X CORRELATION Correlaton coeffcent r 1. Measures the strength of the lnear relatonshp between the two varables represented as x and.. Coeffcent of correlaton use values range from -1 to +1 An alternatve formula for computng the coeffcent of correlaton, r r = Computaton Table r = Coeffcent of Determnaton n x ( x )( ) ( ) ( ) n x x n (x) Optcal denst () x x 3 9.1 4..8 16.4 4.5.5 1.15.5.65 5 1.6 5 4 5.5 3 1.815 3.5 89 6 5.1 36 5 6.5.47 3.55 4.5.9 7.49 3.43 49.41 7.5.53 3.975 56.5.89 x = 49 = 3.4 x = 18. x = 84 = 1.188 1

r ( )( ) ( )( ) = 9 18. 49 3.4 ( 9)( 84) ( 49) ( 9)( 1.188) ( 3.4) = r =.989 r. 99 Coeffcent of Correlaton Values.9891 The statstc r has the followng propertes: 1. r measures the extent of lnear assocaton between two varables.. r has value between 1 and 1. 3. r = 1 f and onl f all the observatons are on a straght lne wth postve slope. 4. r = 1 f and onl f all observatons are on a straght lne wth negatve slope. 5. r tends to be close to zero f there s no lnear assocaton between x and. Perfect Negatve Correlaton No Correlaton Perfect Postve Correlaton -1. -.5 +.5 +1. Increasng degree of negatve correlaton Increasng degree of postve correlaton Coeffcent of Correlaton Examples Although there s no fxed rule or nterpretaton of the strength of a correlaton, we wll sa that the correlaton s Strong f r.8 Moderate f.5 r.8 Weak f r.5 We wll also add the words postve or negatve to ndcate the tpe of correlaton. Warnng The correlaton coeffcent (r) measures the strength of the relatonshp between two varables. Just because two varables are related does not mpl that there s a cause-andeffect relatonshp between them. 11

Fgure Scatter plots llustratng how the correlaton coeffcent, r, s a measure of the lnear assocaton between two varables. 1

Spearman rank correlaton Spearman s rank correlaton s a nonparametrc correlaton coeffcent test. To perform ths procedure, we frst arrange the x values from smallest (rank = 1) to largest (rank = n); let R be the rank of value x. Smlarl, we arrange the values from smallest to largest and assgn a rank from 1 to n for each value; let S be the rank of value. Fnd the dfference between R S, and then square t. The last step s to calculate the Spearman's correlaton coeffcent rho ( ρ ) from 6 ( R S ) ρ = 1 n( n 1) The Spearman rank correlaton ma be used for qualtatve varables whch can be ordered. For example, a varable whch gves an opnon about somethng. Then answer choces such as dslke ver much, dslke, no opnon, lke, and lke ver much have a logcal order to them as the are lsted. An example of a qualtatve varable whch cannot be ordered s ee color whch has values such as blue, green, and brown. There s no wa to logcall order such categores and so we could not use the rankng procedure for them. Note that Pearson correlaton s not approprate n ether of these cases. Example Consder agan the data of the above Example. The next Table contans the prelmnar calculatons necessar to fnd the Spearman rank correlaton. From the Table we fnd that 6 ( R S ) ρ = 1 n n 1 ( ) 6(17) 76 = 1 = 1 = 1.6786 =.773 15( 4) 336 whch s not as the same as what s obtaned for the Pearson correlaton. Ths dfference s due to the presence of extreme observatons. When such extreme values are few or not found, the results of the two tests wll be ver close to each other. Age SBP x Rank R Rank S R-S (R-S) 4.5 13 4-1.5.5 46 4 115 4 4.5 148 5 -.5 6.5 71 8 1 1 7 49 8 1.5 156 9.5 3 9 74 1 16 13.5-3.5 1.5 7 7 151 7 8 1.5 156 9.5 3 9 85 15 16 13.5 1.5.5 7 9 158 11-4 64 6 155 8-4 81 14 16 1 4 41 1 15 3-4 61 5 15 6-1 1 75 11 165 15-4 16 17 13