Chapter 2 Simple Linear Regression

Similar documents
Simple Linear Regression

Lecture 7. Confidence Intervals and Hypothesis Tests in the Simple CLR Model

STA302/1001-Fall 2008 Midterm Test October 21, 2008

Objectives of Multiple Regression

Multiple Regression. More than 2 variables! Grade on Final. Multiple Regression 11/21/2012. Exam 2 Grades. Exam 2 Re-grades

b. There appears to be a positive relationship between X and Y; that is, as X increases, so does Y.

ENGI 3423 Simple Linear Regression Page 12-01

Lecture 8: Linear Regression

Multiple Linear Regression Analysis

Chapter 13 Student Lecture Notes 13-1

Statistics MINITAB - Lab 5

UNIVERSITY OF OSLO DEPARTMENT OF ECONOMICS

Econometric Methods. Review of Estimation

12.2 Estimating Model parameters Assumptions: ox and y are related according to the simple linear regression model

residual. (Note that usually in descriptions of regression analysis, upper-case

Summary of the lecture in Biostatistics

STATISTICAL PROPERTIES OF LEAST SQUARES ESTIMATORS. x, where. = y - ˆ " 1

Simple Linear Regression

Ordinary Least Squares Regression. Simple Regression. Algebra and Assumptions.

Chapter Business Statistics: A First Course Fifth Edition. Learning Objectives. Correlation vs. Regression. In this chapter, you learn:

Midterm Exam 1, section 1 (Solution) Thursday, February hour, 15 minutes

ESS Line Fitting

ECON 482 / WH Hong The Simple Regression Model 1. Definition of the Simple Regression Model

Chapter 14 Logistic Regression Models

Probability and. Lecture 13: and Correlation

Functions of Random Variables

Lecture Notes Types of economic variables


Midterm Exam 1, section 2 (Solution) Thursday, February hour, 15 minutes

ENGI 4421 Joint Probability Distributions Page Joint Probability Distributions [Navidi sections 2.5 and 2.6; Devore sections

Linear Regression with One Regressor

Multiple Choice Test. Chapter Adequacy of Models for Regression

UNIVERSITY OF OSLO DEPARTMENT OF ECONOMICS

Statistics: Unlocking the Power of Data Lock 5

Example: Multiple linear regression. Least squares regression. Repetition: Simple linear regression. Tron Anders Moger

STA 105-M BASIC STATISTICS (This is a multiple choice paper.)

Simple Linear Regression - Scalar Form

Simple Linear Regression and Correlation. Applied Statistics and Probability for Engineers. Chapter 11 Simple Linear Regression and Correlation

STA 108 Applied Linear Models: Regression Analysis Spring Solution for Homework #1

Handout #8. X\Y f(x) 0 1/16 1/ / /16 3/ / /16 3/16 0 3/ /16 1/16 1/8 g(y) 1/16 1/4 3/8 1/4 1/16 1

Chapter Two. An Introduction to Regression ( )

Chapter 13, Part A Analysis of Variance and Experimental Design. Introduction to Analysis of Variance. Introduction to Analysis of Variance

Mean is only appropriate for interval or ratio scales, not ordinal or nominal.

Chapter 8. Inferences about More Than Two Population Central Values

Multivariate Transformation of Variables and Maximum Likelihood Estimation

Correlation and Simple Linear Regression

{ }{ ( )} (, ) = ( ) ( ) ( ) Chapter 14 Exercises in Sampling Theory. Exercise 1 (Simple random sampling): Solution:

Simple Linear Regression and Correlation.

Lecture 3. Sampling, sampling distributions, and parameter estimation

The equation is sometimes presented in form Y = a + b x. This is reasonable, but it s not the notation we use.

ENGI 4421 Propagation of Error Page 8-01

best estimate (mean) for X uncertainty or error in the measurement (systematic, random or statistical) best

Reaction Time VS. Drug Percentage Subject Amount of Drug Times % Reaction Time in Seconds 1 Mary John Carl Sara William 5 4

UNIVERSITY OF EAST ANGLIA. Main Series UG Examination

Chapter 11 The Analysis of Variance

( ) = ( ) ( ) Chapter 13 Asymptotic Theory and Stochastic Regressors. Stochastic regressors model

Investigation of Partially Conditional RP Model with Response Error. Ed Stanek

ECONOMETRIC THEORY. MODULE VIII Lecture - 26 Heteroskedasticity

Lecture 2: Linear Least Squares Regression

hp calculators HP 30S Statistics Averages and Standard Deviations Average and Standard Deviation Practice Finding Averages and Standard Deviations

Lecture 1 Review of Fundamental Statistical Concepts

1. The weight of six Golden Retrievers is 66, 61, 70, 67, 92 and 66 pounds. The weight of six Labrador Retrievers is 54, 60, 72, 78, 84 and 67.

University of Belgrade. Faculty of Mathematics. Master thesis Regression and Correlation

Module 7: Probability and Statistics

THE ROYAL STATISTICAL SOCIETY HIGHER CERTIFICATE

CHAPTER 2. = y ˆ β x (.1022) So we can write

Analysis of Variance with Weibull Data

Chapter 4 Multiple Random Variables

CLASS NOTES. for. PBAF 528: Quantitative Methods II SPRING Instructor: Jean Swanson. Daniel J. Evans School of Public Affairs

Simulation Output Analysis

Bootstrap Method for Testing of Equality of Several Coefficients of Variation

Maximum Likelihood Estimation

MEASURES OF DISPERSION

Discrete Mathematics and Probability Theory Fall 2016 Seshia and Walrand DIS 10b

Fundamentals of Regression Analysis

X X X E[ ] E X E X. is the ()m n where the ( i,)th. j element is the mean of the ( i,)th., then

Applied Statistics and Probability for Engineers, 5 th edition February 23, b) y ˆ = (85) =

Lecture 1: Introduction to Regression

4. Standard Regression Model and Spatial Dependence Tests

ε. Therefore, the estimate

THE ROYAL STATISTICAL SOCIETY GRADUATE DIPLOMA

Statistics. Correlational. Dr. Ayman Eldeib. Simple Linear Regression and Correlation. SBE 304: Linear Regression & Correlation 1/3/2018

Lecture 3 Probability review (cont d)

Chapter 5 Properties of a Random Sample

CHAPTER VI Statistical Analysis of Experimental Data

TESTS BASED ON MAXIMUM LIKELIHOOD

Chapter 2 Supplemental Text Material

Chapter 3 Sampling For Proportions and Percentages

"It is the mark of a truly intelligent person to be moved by statistics." George Bernard Shaw

Continuous Distributions

Lecture 1: Introduction to Regression

: At least two means differ SST

2.28 The Wall Street Journal is probably referring to the average number of cubes used per glass measured for some population that they have chosen.

Lecture Note to Rice Chapter 8

REVIEW OF SIMPLE LINEAR REGRESSION SIMPLE LINEAR REGRESSION

Chapter -2 Simple Random Sampling

Logistic regression (continued)

Statistics Review Part 3. Hypothesis Tests, Regression

Recall MLR 5 Homskedasticity error u has the same variance given any values of the explanatory variables Var(u x1,...,xk) = 2 or E(UU ) = 2 I

Based on Neter, Wasserman and Whitemore: Applied Statistics, Chapter 18, pp

Transcription:

Chapter Smple Lear Regresso. Itroducto ad Least Squares Estmates Regresso aalyss s a method for vestgatg the fuctoal relatoshp amog varables. I ths chapter we cosder problems volvg modelg the relatoshp betwee two varables. These problems are commoly referred to as smple lear regresso or straght-le regresso. I later chapters we shall cosder problems volvg modelg the relatoshp betwee three or more varables. I partcular we ext cosder problems volvg modelg the relatoshp betwee two varables as a straght le, that s, whe Y s modeled as a lear fucto of X. Example: A regresso model for the tmg of producto rus We shall cosder the followg example take from Foster, Ste ad Waterma (997, pages 9 99) throughout ths chapter. The orgal data are the form of the tme take ( mutes) for a producto ru, Y, ad the umber of tems produced, X, for radomly selected orders as supervsed by three maagers. At ths stage we shall oly cosder the data for oe of the maagers (see Table. ad Fgure. ). We wsh to develop a equato to model the relatoshp betwee Y, the ru tme, ad X, the ru sze. A scatter plot of the data lke that gve Fgure. should ALWAYS be draw to obta a dea of the sort of relatoshp that exsts betwee two varables (e.g., lear, quadratc, expoetal, etc.)... Smple Lear Regresso Models Whe data are collected pars the stadard otato used to desgate ths s: (x, y ),(x, y ),...,(x, y ) where x deotes the frst value of the so-called X -varable ad y deotes the frst value of the so-called Y -varable. The X -varable s called the explaatory or predctor varable, whle the Y -varable s called the respose varable or the depedet varable. The X -varable ofte has a dfferet status to the Y -varable that: S.J. Sheather, A Moder Approach to Regresso wth R, 5 DOI:.7/978--387-968-7_, Sprger Scece + Busess Meda LLC 9

6 Smple Lear Regresso Table. Producto data (producto.txt) Case Ru tme Ru sze Case Ru tme Ru sze 95 75 337 5 89 68 58 3 43 344 3 7 46 4 6 88 4 5 77 5 85 4 5 69 3 6 3 338 6 5 7 7 34 7 7 47 63 8 66 73 8 3 337 9 53 84 9 8 46 96 77 7 68 4 Ru Tme 6 Fgure. A scatter plot of the producto data 5 3 Ru Sze It ca be thought of as a potetal predctor of the Y-varable Its value ca sometmes be chose by the perso udertakg the study Smple lear regresso s typcally used to model the relatoshp betwee two varables Y ad X so that gve a specfc value of X, that s, X = x, we ca predct the value of Y. Mathematcally, the regresso of a radom varable Y o a radom varable X s E(Y X = x), the expected value of Y whe X takes the specfc value x. For example, f X = Day of the week ad Y = Sales at a gve compay, the the regresso of Y o X represets the mea (or average) sales o a gve day. The regresso of Y o X s lear f

. Itroducto ad Least Squares Estmates 7 E( Y X = x) = b + b x (.) where the ukow parameters b ad b determe the tercept ad the slope of a specfc straght le, respectvely. Suppose that Y, Y,, Y are depedet realzatos of the radom varable Y that are observed at the values x, x,, x of a radom varable X. If the regresso of Y o X s lear, the for =,,, Y = E( Y X = x) + e = b + b x+ e where e s the radom error Y ad s such that E(e X) =. The radom error term s there sce there wll almost certaly be some varato Y due strctly to radom pheomeo that caot be predcted or explaed. I other words, all uexplaed varato s called radom error. Thus, the radom error term does ot deped o x, or does t cota ay formato about Y (otherwse t would be a systematc error). We shall beg by assumg that V ar ( Y X = x ) = s. (.) I Chapter 4 we shall see how ths last assumpto ca be relaxed. Estmatg the populato slope ad tercept Suppose for example that X = heght ad Y = weght of a radomly selected dvdual from some populato, the for a straght le regresso model the mea weght of dvduals of a gve heght would be a lear fucto of that heght. I practce, we usually have a sample of data stead of the whole populato. The slope b ad tercept b are ukow, sce these are the values for the whole populato. Thus, we wsh to use the gve data to estmate the slope ad the tercept. Ths ca be acheved by fdg the equato of the le whch best fts our data, that s, choose b ad b such that yˆ = b + bx s as close as possble to y. Here the otato ŷ s used to deote the value of the le of best ft order to dstgush t from the observed values of y, that s, y. We shall refer to ŷ as the th predcted value or the ftted value of y. Resduals I practce, we wsh to mmze the dfferece betwee the actual value of y (y ) ad the predcted value of y (ŷ ). Ths dfferece s called the resdual, ê, that s, ê = y ŷ. Fgure. shows a hypothetcal stuato based o sx data pots. Marked o ths plot s a le of best ft, ŷ alog wth the resduals. Least squares le of best ft A very popular method of choosg b ad b s called the method of least squares. As the ame suggests b ad b are chose to mmze the sum of squared resduals (or resdual sum of squares [RSS]),

8 Smple Lear Regresso 5 ê 6 Y 5 ê 3 ê4 ê 5 Le of best ft ê ê X 3 4 5 Fgure. A scatter plot of data wth a le of best ft ad the resduals detfed eˆ ˆ y y y b bx = = = RSS = = ( ) = ( ). For RSS to be a mmum wth respect to b ad b we requre RSS = ( y b bx) = b = ad RSS = x( y b bx) = b = Rearragg terms these last two equatos gves ad y = b + b x = = xy = bx + bx = = =. These last two equatos are called the ormal equatos. Solvg these equatos for b ad b gves the so-called least squares estmates of the tercept bˆ = y bˆ x (.3)

. Itroducto ad Least Squares Estmates 9 ad the slope x y xy ( x x)( y y) SXY ˆ = = = = =. SXX x x ( x x) = = b (.4) Regresso Output from R The least squares estmates for the producto data were calculated usg R, gvg the followg results: Coeffcets: Estmate Std. Error t value Pr(> t ) (Itercept) 49.7477 8.385 7.98 6.e-3 *** RuSze.594.374 6.98.6e-6 *** --- Sgf. codes: ***. **. *.5.. Resdual stadard error: 6.5 o 8 degrees of freedom Multple R-Squared:.73, Adjusted R-squared:.75 F-statstc: 48.7 o ad 8 DF, p-value:.65e-6 The least squares le of best ft for the producto data Fgure.3 shows a scatter plot of the producto data wth the least squares le of best ft. The equato of the least squares le of best ft s y = 49.7 +.6 x. Let us look at the results that we have obtaed from the le of best ft Fgure.3. The tercept Fgure.3 s 49.7, whch s where the le of best ft crosses the ru tme axs. The slope of the le Fgure.3 s.6. Thus, we say that each addtoal ut to be produced s predcted to add.6 mutes to the ru tme. The tercept the model has the followg terpretato: for ay producto ru, the average set up tme s 49.7 mutes. Estmatg the varace of the radom error term Cosder the lear regresso model wth costat varace gve by (.) ad (.). I ths case, Y = b + b x + e ( =,,..., ) where the radom error e has mea ad varace s. We wsh to estmate s = Var(e). Notce that e = Y ( b + b x ) = Y ukow regresso le at x.

Smple Lear Regresso 4 Ru Tme 8 6 5 5 5 3 35 Ru Sze Fgure.3 A plot of the producto data wth the least squares le of best ft Sce b ad b are ukow all we ca do s estmate these errors by replacg b ad b by ther respectve least squares estmates ad gvg the resduals bˆ bˆ eˆ = Y ( bˆ + bˆ x ) = Y estmated regresso le at x. These resduals ca be used to estmate s. I fact t ca be show that S RSS = = eˆ = s a ubased estmate of s. Two pots to ote are:. e ˆ = (sce e ˆ = as the least squares estmates mmze RSS = eˆ ). The dvsor S s sce we have estmated two parameters, amely b ad b.. Ifereces About the Slope ad the Itercept I ths secto, we shall develop methods for fdg cofdece tervals ad for performg hypothess tests about the slope ad the tercept of the regresso le.

. Ifereces About the Slope ad the Itercept.. Assumptos Necessary Order to Make Ifereces About the Regresso Model Throughout ths secto we shall make the followg assumptos:. Y s related to x by the smple lear regresso model Y = b + b x + e ( =,..., ),.e., E( Y X = x ) = b + bx. The errors e, e,..., e are depedet of each other 3. The errors e, e,..., e have a commo varace s 4. The errors are ormally dstrbuted wth a mea of ad varace s, that s, e X~ N(, s ) Methods for checkg these four assumptos wll be cosdered Chapter 3. I addto, sce the regresso model s codtoal o X we ca assume that the values of the predctor varable, x, x,, x are kow fxed costats... Ifereces About the Slope of the Regresso Le Recall from (.4) that the least squares estmate of b s gve by bˆ x y xy ( x x)( y y) = = = = = x x ( x x) = = SXY SXX Sce, ( x x) = we fd that = ( x x)( y y) = ( x x) y y ( x x) = ( x x) y = = = = Thus, we ca rewrte bˆ as ˆ x x b = cy where c = (.5) SXX = We shall see that ths verso of wll be used wheever we study ts theoretcal bˆ propertes. Uder the above assumptos, we shall show Secto.7 that E( bˆ X ) = b (.6) s Var( b ˆ X) = SXX (.7)

Smple Lear Regresso s b ˆ b X~ N, SXX (.8) Note that (.7) the varace of the least squares slope estmate decreases as SXX creases (.e., as the varablty the X s creases). Ths s a mportat fact to ote f the expermeter has cotrol over the choce of the values of the X varable. Stadardzg (.8) gves bˆ Z = s b SXX ~ N(,) If s were kow the we could use a Z to test hypotheses ad fd cofdece tervals for b. Whe s s ukow (as s usually the case) replacg s by S, the stadard devato of the resduals results bˆ b bˆ b T = = S se( bˆ ) SXX where se ( b ˆ ) = S s the estmated stadard error (se) of, whch s gve bˆ SXX drectly by R. I the producto example the X -varable s RuSze ad so se (bˆ ) =.374. It ca be show that uder the above assumptos that T has a t-dstrbuto wth degrees of freedom, that s bˆ b T = se( ˆ ) ~ t b Notce that the degrees of freedom satsfes the followg formula degrees of freedom = sample sze umber of mea parameters estmated. I ths case we are estmatg two such parameters, amely, b ad b. For testg the hypothess H : b = b the test statstc s bˆ b T = ~ t whe s true. se( ˆ H b ) R provdes the value of T ad the p -value assocated wth testg H : b = agast H A : b (.e., for the choce b = ). I the producto example the X-varable s RuSze ad T = 6.98, whch results a p -value less tha.. A ( a) % cofdece terval for b, the slope of the regresso le, s gve by

. Ifereces About the Slope ad the Itercept 3 ( b ˆ t( a/, -)se( b ˆ ), b ˆ + t( a/, -)se( b ˆ )) where t(a /, ) s the ( a / )th quatle of the t -dstrbuto wth degrees of freedom. I the producto example the X -varable s RuSze ad bˆ ˆ =.594, se( b ) =.374, t (.5, = 8) =.9. Thus a 95% cofdece terval for b s gve by (.594 ±.9.374) = (.594 ±.783) = (.8,.337)..3 Ifereces About the Itercept of the Regresso Le Recall from (.3) that the least squares estmate of b s gve by bˆ = y bˆ x Uder the assumptos gve prevously we shall show Secto.7 that ˆ E( b X) b = (.9) b ˆ x X = s + Var( ) SXX (.) SXX ˆ x X~ N b, s + b (.) Stadardzg (.) gves Z = s bˆ b + x SXX ~ N(,) If s were kow the we could use Z to test hypotheses ad fd cofdece tervals for b. Whe s s ukow (as s usually the case) replacg σ by S results bˆ b bˆ b T = = ~ t ˆ x se( b ) S + SXX where se ( b ˆ ) = S x + SXX s the estmated stadard error of bˆ, whch s gve drectly by R. I the producto example the tercept s called Itercept ad so se(bˆ ) = 8.385.

4 Smple Lear Regresso For testg the hypothess H : b = b the test statstc s bˆ b T = ~ t whe s true. se( ˆ H b ) R provdes the value of T ad the p -value assocated wth testg H : b = agast H A : b. I the producto example the tercept s called Itercept ad T = 7.98 whch results a p -value <.. A ( a )% cofdece terval for b, the tercept of the regresso le, s gve by ( b ˆ t( a/, ) se( b ˆ ), b ˆ + t( a /, )se( b ˆ )) where t(a /, ) s the ( a / ) th quatle of the t -dstrbuto wth degrees of freedom. I the producto example, bˆ = 49.7477, se( bˆ ) = 8.385, t(.5, = 8) =.9. Thus a 95% cofdece terval for b s gve by (49.7477 ±.9 8.385) = (49.748 ± 7.497) = (3.3,67.) Regresso Output from R: 95% cofdece tervals.5% 97.5% (Itercept) 3.5 67.44 RuSze.8.337.3 Cofdece Itervals for the Populato Regresso Le I ths secto we cosder the problem of fdg a cofdece terval for the ukow populato regresso le at a gve value of X, whch we shall deote by x *. Frst, recall from (.) that the populato regresso le at X = x * s gve by E( Y X = x*) = b + b x* A estmator of ths ukow quatty s the value of the estmated regresso equato at X = x *, amely, yˆ* = bˆ + bˆ x* Uder the assumptos stated prevously, t ca be show that E( yˆ*) = E( yˆ X = x*) = b + b x* (.)

.4 Predcto Itervals for the Actual Value of Y 5 ( x* x) Var( yˆ*) = Var( yˆ X = x*) = s + SXX (.3) ( x* x) yˆ* = yˆ X = x* N b + bx*, s + SXX (.4) Stadardzg (.4) gves Z = yˆ * ( b + bx*) N(,) s ( x* x) ( + ) SXX Replacg s by S results yˆ * ( b + bx*) T = t ( x* x) S ( + ) SXX A ( a)% cofdece terval for E( Y X = x*) = b + bx*, the populato regresso le at X = x *, s gve by ( x* x) yˆ * ± t( a/, ) S ( + ) SXX ˆ ˆ ( x* x) = b + b x* ± t( a/, ) S ( + ) SXX where t( a/, s ) the ( a/)th quatle of the t -dstrbuto wth degrees of freedom..4 Predcto Itervals for the Actual Value of Y I ths secto we cosder the problem of fdg a predcto terval for the actual value of Y at x *, a gve value of X. Importat Notes:. E( Y X = x*), the expected value or average value of Y for a gve value x * of X, s what oe would expect Y to be the log ru whe X = x *. E( Y X = x*) s therefore a fxed but ukow quatty whereas Y ca take a umber of values whe X = x *.

6 Smple Lear Regresso. E(Y X = x*), the value of the regresso le at X = x *, s etrely dfferet from Y *, a sgle value of Y whe X = x *. I partcular, Y * eed ot le o the populato regresso le. 3. A cofdece terval s always reported for a parameter (e.g., E(Y X = x*) = b + b x* ) ad a predcto terval s reported for the value of a radom varable (e.g., Y *). We base our predcto of Y whe X = x * (that s of Y *) o The error our predcto s yˆ* = bˆ + bˆ x* Y* yˆ* = b + b x* + e* yˆ* = E( Y X = x*) yˆ* + e* that s, the devato betwee E(Y X = x*) ad ŷ* plus the radom fluctuato e* (whch represets the devato of Y * from E(Y X = x*)). Thus the varablty the error for predctg a sgle value of Y wll exceed the varablty for estmatg the expected value of Y (because of the radom error e *). It ca be show that uder the prevously stated assumptos that E( Y* yˆ*) = E( Y yˆ X = x*) = (.5) ( x* x) Var( Y* yˆ*) = Var( Y yˆ X = x*) = s + + SXX (.6) ( x* x) Y* yˆ * ~ N, s + + SXX (.7) Stadardzg (.7) ad replacg s by S gves T = S Y* yˆ * ( x* x) ( + + ) SXX ~ t A ( a)% predcto terval for Y *, the value of Y at X = x *, s gve by ( x* x) yˆ * ± t( a/, ) S ( + + ) SXX ˆ ˆ ( x* x) = b + b x* ± t( a/, ) S ( + + ) SXX

.5 Aalyss of Varace 7 where t(a /, ) s the ( a / )th quatle of the t -dstrbuto wth degrees of freedom. Regresso Output from R Nety-fve percet cofdece tervals for the populato regresso le (.e., the average RuTme) at RuSze = 5,, 5,, 5, 3, 35 are: ft lwr upr 6.799 48.64 76.7994 75.67 64.6568 86.687 3 88.634 79.9969 97.74 4.5963 93.96 9.36 5 4.5585 6.455 3.74 6 7.56 6.76 38.347 7 4.488 6.6 54.3435 Nety-fve percet predcto tervals for the actual value of Y (.e., the actual RuTme) at at RuSze = 5,, 5,, 5, 3, 35 are: ft lwr upr 6.799 5.77 99.6478 75.67 39.794.55 3 88.634 53.435 3.8548 4.5963 66.676 36.585 5 4.5585 79.368 49.7489 6 7.56 9.7 63.339 7 4.488 3.635 77.334 Notce that each predcto terval s cosderably wder tha the correspodg cofdece terval, as s expected..5 Aalyss of Varace There s a lear assocato betwee Y ad x f Y = b + b x + e ad b. If we kew that b the we would predct Y by ŷ = bˆ + bˆ x O the other had, f we kew that b = the we predct Y by ŷ = y To test whether there s a lear assocato betwee Y ad X we have to test H : b = agast H A : b.

8 Smple Lear Regresso We ca perform ths test usg the followg t-statstc bˆ = T t se( bˆ whe H ) s true. We ext look at a dfferet test statstc whch ca be used whe there s more tha oe predctor varable, that s, multple regresso. Frst, we troduce some termology. Defe the total corrected sum of squares of the Y s by SST = SYY = ( y y) Recall that the resdual sum of squares s gve by RSS = ( y yˆ ) Defe the regresso sum of squares (.e., sum of squares explaed by the regresso model) by SSreg = ( yˆ y) It s clear that SSreg s close to zero f for each, ŷ s close to ȳ whle SSreg s large f ŷ dffers from ȳ for most values of x. We ext look at the hypothetcal stuato Fgure.4 wth just a sgle data pot ( x, y ) show alog wth the least squares regresso le ad the mea of y based o all data pots. It s apparet from Fgure.4 that y ( ˆ ) ( ˆ y = y y + y y). Further, t ca be show that SST = SSreg + RSS Total sample = Varablty explaed by + Uexplaed (or error) varablty the model varablty See exercse 6 Secto.7 for detals. If Y = b + b x+ e ad b the RSS should be small ad SSreg should be close to SST. But how small s small ad how close s close?

.5 Aalyss of Varace 9 Fgure.4 Graphcal depcto that y y = ( y yˆ) + ( yˆ y ) To test we ca use the test statstc H : b = agast H A : b F = SSreg / RSS /( ) sce RSS has ( ) degrees of freedom ad SSreg has degree of freedom. Uder the assumpto that e, e,..., e are depedet ad ormally dstrbuted wth mea ad varace s, t ca be show that F has a F dstrbuto wth ad degrees of freedom whe H s true, that s, F = SSreg / ~ RSS /( ) F, whe H s true Form of test: reject H at level a f F > F a,, (whch ca be obtaed from table of the F dstrbuto). However, all statstcal packages report the correspodg p-value.

3 Smple Lear Regresso The usual way of settg out ths test s to use a Aalyss of varace table Source of varato Degrees of freedom (df) Sum of squares (SS) Mea square (MS) Regresso SSreg SSreg/ SSreg / F = RSS /( ) Resdual RSS RSS/( ) Total SST F Notes:. It ca be show that the case of smple lear regresso bˆ T = ~ se( bˆ ) SSreg / ad F = ~ F, are related va F = T RSS /( ) t. R, the coeffcet of determato of the regresso le, s defed as the proporto of the total sample varablty the Y s explaed by the regresso model, that s, SSreg RSS R = = SST SST The reaso ths quatty s called R s that t s equal to the square of the correlato betwee Y ad X. It s arguably oe of the most commoly msused statstcs. Regresso Output from R Aalyss of Varace Table Respose: RuTme Df Sum Sq Mea Sq F value Pr(>F) RuSze 868.4 868.4 48.77.65e-6 *** Resduals 8 4754.6 64. --- Sgf. codes: ***. **. *.5.. Notce that the observed F -value of 48.77 s just the square of the observed t-value 6.98 whch ca be foud betwee Fgures. ad.3. We shall see Chapter 5 that Aalyss of Varace overcomes the problems assocated wth multple t-tests whch occur whe there are may predctor varables..6 Dummy Varable Regresso So far we have oly cosdered stuatos whch the predctor or X-varable s quattatve (.e., takes umercal values). We ext cosder so-called dummy varable regresso, whch s used ts smplest form whe a predctor s categorcal

.6 Dummy Varable Regresso 3 wth two values (e.g., geder) rather tha quattatve. The resultg regresso models allow us to test for the dfferece betwee the meas of two groups. We shall see a later topc that the cocept of a dummy varable ca be exteded to clude problems volvg more tha two groups. Usg dummy varable regresso to compare ew ad old methods We shall cosder the followg example throughout ths secto. It s take from Foster, Ste ad Waterma (997, pages 4 48). I ths example, we cosder a large food processg ceter that eeds to be able to swtch from oe type of package to aother quckly to react to chages order patters. Cosultats have developed a ew method for chagg the producto le ad used t to produce a sample of 48 chage-over tmes ( mutes). Also avalable s a depedet sample of 7 chage-over tmes ( mutes) for the exstg method. These two sets of tmes ca be foud o book web ste the fle called chageover_tmes. txt. The frst three ad the last three rows of the data from ths fle are reproduced below Table.. Plots of the data appear Fgure.5. We wsh to develop a equato to model the relatoshp betwee Y, the chage-over tme ad X, the dummy varable correspodg to New ad hece test whether the mea chage-over tme s reduced usg the ew method. We cosder the smple lear regresso model Y = b + b x+ e where Y = chage-over tme ad x s the dummy varable (.e., x = f the tme correspods to the ew chage-over method ad f t correspods to the exstg method). Regresso Output from R Coeffcets: Estmate Std. Error t value Pr(> t ) (Itercept) 7.86.895.58 <e-6 *** New -3.736.48 -.54.6 * --- Sgf. codes: ***. **. *.5.. Resdual stadard error: 7.556 o 8 degrees of freedom Multple R-Squared:.48, Adjusted R-squared:.335 F-statstc: 5.8 o ad 8 DF, p-value:.64 We ca test whether there s sgfcat reducto the chage-over tme for the ew method by testg the sgfcace of the dummy varable, that s, we wsh to test whether the coeffcet of x s zero or less tha zero, that s: H : b = agast H A : b < We use the oe-sded < alteratve sce we are terested whether the ew method has lead to a reducto mea chage-over tme. The test statstc s bˆ = se( bˆ ) T ~ t H whe s true.

3 Smple Lear Regresso Table. Chage-over tme data (chageover_tmes.txt) Method Y, Chage-over tme X, New Exstg 9 Exstg 4 Exstg 39... New 4 New 4 New 35 Chage Over Tme 35 5 5 5 Chage Over Tme 35 5 5 5...4.6.8. Dummy Varable, New Dummy Varable, New Chage Over Tme 35 5 5 5 Exstg New Method Fgure.5 A scatter plot ad box plots of the chage-over tme data I ths case, T =.54. (Ths result ca be foud the output the colum headed t value ). The assocated p -value s gve by.6 p value = P( T <.54 whe H s true) = =.3 as the two-sded p- value = P( T.54 whe H s true) =.6. Ths meas that there s sgfcat evdece of a reducto the mea chageover tme for the ew method.

.7 Dervatos of Results 33 Next cosder the group cosstg of those tmes assocated wth the ew chage-over method. For ths group, the dummy varable, x s equal to. Thus, we ca estmate the mea chage-over tme for the ew method as: 7.86 + ( 3.736) = 4.6875 = 4.7 mutes Next cosder the group cosstg of those tmes assocated wth the exstg chage-over method. For ths group, the dummy varable, x s equal to. Thus, we ca estmate the mea chage-over tme for the ew method as: 7.86 + ( 3.736) = 7.86 = 7.9 mutes The ew chage-over method produces a reducto the mea chage-over tme of 3. m from 7.9 to 4.7 mutes (Notce that the reducto the mea chageover tme for the ew method s just the coeffcet of the dummy varable.) Ths reducto s statstcally sgfcat. A 95% cofdece terval for the reducto mea chage-over tme due to the ew method s gve by ( b ˆ t( a/, )se( b ˆ ), b ˆ + t( a/, )se( b ˆ )) where t( a /, ) s the ( a/ ) th quatle of the t -dstrbuto wth degrees of freedom. I ths example the X -varable s the dummy varable New ad b ˆ = 3.736, se( b ˆ ) =.48, t(.5, = 8) =. 983. Thus a 95% cofdece terval for b ( mutes) s gve by ( 3.736 ±.983.48) = ( 3.736 ±.7883) = ( 5.96,.39). Fally, the compay should adopt the ew method f a reducto of tme of ths sze s of practcal sgfcace..7 Dervatos of Results I ths secto, we shall derve some results gve earler about the least squares estmates of the slope ad the tercept as well as results about cofdece tervals ad predcto tervals. Throughout ths secto we shall make the followg assumptos:. Y s related to x by the smple lear regresso model Y = b + bx + e ( =,..., ), e..,e( Y X = x) = b + bx. The errors e,e,...,e are depedet of each other 3. The errors e,e,...,e have a commo varace s 4. The errors are ormally dstrbuted wth a mea of ad varace s (especally whe the sample sze s small), that s, e X~ N(, s )

34 Smple Lear Regresso I addto, sce the regresso model s codtoal o X we ca assume that the values of the predctor varable, x, x,, x are kow fxed costats..7. Ifereces about the Slope of the Regresso Le Recall from (.5) that the least squares estmate of b s gve by ˆ x x b = cy where c =. SXX = Uder the above assumptos we shall derve (.6), (.7) ad (.8). To derve (.6) let s cosder sce ˆ E( b X) = E cy X = x = = = [ ] = ce y X= x ( b b x) = c + = b c + b c x = = x x x x + x = SXX = SXX =b b = b ( x x) = ad ( x x) x = x x = SXX. = = = To derve (.7) let s cosder ˆ Var( b X) = Var cy X = x = = c Var( y X = x ) = =σ c = x x =σ = SXX σ = SXX

.7 Dervatos of Results 35 Fally we derve (.8). Uder assumpto (4), the errors e X are ormally dstrbuted. Sce y = b + b x + e ( =,,..., ), Y X s ormally dstrbuted. Sce b ˆ X s a lear combato of the y s, b ˆ X s ormally dstrbuted..7. Ifereces about the Itercept of the Regresso Le Recall from (.3) that the least squares estmate of b s gve by b ˆ = y b ˆ x. Uder the assumptos gve prevously we shall derve (.9), (.) ad (.). To derve (.9) we shall use the fact that The frst pece of the last equato s E( b ˆ X ) = E( y X ) E( b ˆ X ) x E( y X) = E( y X = x) = = E( b + bx + e) = = b + b = b + b x = x The secod pece of that equato s E( bˆ Xx ) = b x. Thus, E( bˆ X) = E( y X) E( bˆ X) x = b + b x b x = b To derve (.) let s cosder Var( bˆ ˆ X) = Var( y bx X) = + ˆ The frst term s gve by Var( y X) x Var( b X) xcov( y, b X) s s Var( y X) = Var( y X = x) = =. = ˆ

36 Smple Lear Regresso From (.7), ˆ s Var( b X) = SXX Fally, So, ˆ s Cov( y, b X) = Cov y, c y = ccov( y, y ) = c = = = = = b ˆ x X = s + Var( ) SXX Result (.) follows from the fact that uder assumpto (4), Y X (ad hece ȳ ) are ormally dstrbuted as s b ˆ X..7.3 Cofdece Itervals for the Populato Regresso Le Recall that the populato regresso le at X = x * s gve by E( Y X = x*) = b + b x* A estmator the populato regresso le at X = x * (.e., E( Y X = x*) = b + bx* ) s the value of the estmated regresso equato at X = x *, amely, yˆ* = bˆ + bˆ x* Uder the assumptos stated prevously, we shall derve (.), (.3) ad (.4). Frst, otce that (.) follows from the followg earler establshed results E( bˆ X = x*) = b ad E( bˆ X = x*) = b. Next, cosder (.3) Var( yˆ X = x*) = Var( bˆ + bˆ x X = x*) = Var( bˆ X = x*) + x* Var( bˆ X = x*) + x* Cov( b ˆ, b ˆ X = x *) Now, Cov( bˆ, bˆ X = x*) = Cov( y bˆ x, bˆ X = x*) = Cov( y, bˆ X = x*) xcov( bˆ ˆ, b) = x Var( bˆ ) xs = SXX

.7 Dervatos of Results 37 So that, x s x* xs Var( yˆ X = x*) = s + + x* SXX SXX SXX s ( x* x) SXX = + Result (.4) follows from the fact that uder assumpto (4), b ˆ X s ormally dstrbuted as s b ˆ X..7.4 Predcto Itervals for the Actual Value of Y We base our predcto of Y whe X = x * (that s of Y *) o The error our predcto s yˆ* = bˆ + bˆ x* Y* yˆ* = b + b x* + e* yˆ* = E( Y X = x*) yˆ* + e* that s, the devato betwee E( Y X = x*) ad ŷ* plus the radom fluctuato e* (whch represets the devato of Y * from E( Y X = x*) ). Uder the assumptos stated prevously, we shall derve (.5), (.6) ad (.7). Frst, we cosder (.5) E( Y* yˆ*) = E( Y yˆ X = x*) = E( Y X = x*) E( bˆ ˆ + bx X = x* ) = I cosderg (.6), otce that ŷ s depedet of Y *, a future value of Y. Thus, Var( Y* yˆ*) = Var( Y yˆ X = x*) = Var( Y X = x*) + Var( yˆ X = x*) Cov( Y, yˆ X = x*) = s + s + ( x* x) SXX = s + + ( x* x) SXX Fally, (.7) follows sce both ŷ ad Y * are ormally dstrbuted.

38 Smple Lear Regresso.8 Exercses. The web ste www.playbll.com provdes weekly reports o the box offce tcket sales for plays o Broadway New York. We shall cosder the data for the week October 7, 4 (referred to below as the curret week). The data are the form of the gross box offce results for the curret week ad the gross box offce results for the prevous week (.e., October 3, 4). The data, plotted Fgure.6, are avalable o the book web ste the fle playbll.csv. Ft the followg model to the data: Y = b + bx+ e where Y s the gross box offce results for the curret week ( $) ad x s the gross box offce results for the prevous week ( $). Complete the followg tasks: (a) Fd a 95% cofdece terval for the slope of the regresso model, b. Is a plausble value for b? Gve a reaso to support your aswer. (b) Test the ull hypothess H : b = agast a two-sded alteratve. Iterpret your result. (c) Use the ftted regresso model to estmate the gross box offce results for the curret week ( $) for a producto wth $4, gross box offce the prevous week. Fd a 95% predcto terval for the gross box offce Gross Box Offce Results Curret Week 8 6 4 4 6 8 Gross Box Offce Results Prevous Week Fgure.6 Scatter plot of gross box offce results from Broadway

.8 Exercses 39 results for the curret week ( $) for a producto wth $4, gross box offce the prevous week. Is $45, a feasble value for the gross box offce results the curret week, for a producto wth $4, gross box offce the prevous week? Gve a reaso to support your aswer. (d) Some promoters of Broadway plays use the predcto rule that ext week s gross box offce results wll be equal to ths week s gross box offce results. Commet o the approprateess of ths rule.. A story by James R. Hagerty ettled Wth Buyers Sdeled, Home Prces Slde publshed the Thursday October 5, 7 edto of the Wall Street Joural cotaed data o so-called fudametal housg dcators major real estate markets across the US. The author argues that prces are geerally fallg ad overdue loa paymets are plg up. Thus, we shall cosder data preseted the artcle o Y = Percetage chage average prce from July 6 to July 7 (based o the S&P/Case-Shller atoal housg dex); ad x = Percetage of mortgage loas 3 days or more overdue latest quarter (based o data from Equfax ad Moody s). The data are avalable o the book web ste the fle dcators.txt. Ft the followg model to the data: Y = b + bx+ e. Complete the followg tasks: (a) Fd a 95% cofdece terval for the slope of the regresso model, b. O the bass of ths cofdece terval decde whether there s evdece of a sgfcat egatve lear assocato. (b) Use the ftted regresso model to estmate E ( Y X =4). Fd a 95% cofdece terval for E ( Y X =4). Is % a feasble value for E ( Y X =4)? Gve a reaso to support your aswer. 3. The maager of the purchasg departmet of a large compay would lke to develop a regresso model to predct the average amout of tme t takes to process a gve umber of voces. Over a 3-day perod, data are collected o the umber of voces processed ad the total tme take ( hours). The data are avalable o the book web ste the fle voces.txt. The followg model was ft to the data: Y = b + bx+ e where Y s the processg tme ad x s the umber of voces. A plot of the data ad the ftted model ca be foud Fgure.7. Utlzg the output from the ft of ths model provded below, complete the followg tasks. (a) Fd a 95% cofdece terval for the start-up tme,.e., b. (b) Suppose that a best practce bechmark for the average processg tme for a addtoal voce s. hours (or.6 mutes). Test the ull hypothess H : b =. agast a two-sded alteratve. Iterpret your result. (c) Fd a pot estmate ad a 95% predcto terval for the tme take to process 3 voces.

4 Smple Lear Regresso 4. 3.5 Processg Tme 3..5..5. Fgure.7 Scatter plot of the voce data 5 5 5 Number of Ivoces Regresso output from R for the voce data Call: lm(formula = Tme ~ Ivoces) Coeffcets: Estmate Std. Error t value Pr(> t ) (Itercept).64799.77 5.48.4e-5 *** Ivoces.96.884 3.797 5.7e-4 *** --- Resdual stadard error:.398 o 8 degrees of freedom Multple R-Squared:.878, Adjusted R-squared:.867 F-statstc: 9.4 o ad 8 DF, p-value: 5.75e-4 mea(tme). meda(tme) mea(ivoces) 3. meda(ivoces) 7.5 4. Straght-le regresso through the org: I ths questo we shall make the followg assumptos: () Y s related to x by the smple lear regresso model Y = bx + e ( =,,..., ),.e., E( Y X = x ) = bx

.8 Exercses 4 () The errors e, e,..., e are depedet of each other (3) The errors e, e,..., e have a commo varace s (4) The errors are ormally dstrbuted wth a mea of ad varace s (especally whe the sample sze s small),.e., e X~ N(, s ) I addto, sce the regresso model s codtoal o X we ca assume that the values of the predctor varable, x, x,, x are kow fxed costats. (a) Show that the least squares estmate of b s gve by bˆ = = = xy x (b) Uder the above assumptos show that () () () E( bˆ X) = b Var( bˆ X) = = = s bˆ X ~ N ( b, s ) x x 5. Two alteratve straght le regresso models have bee proposed for Y. I the frst model, Y s a lear fucto of x, whle the secod model Y s a lear fucto of x. The plot the frst colum of Fgure.8 s that of Y agast x, whle the plot the secod colum below s that of Y agast x. These plots also show the least squares regresso les. I the followg statemets RSS stads for resdual sum of squares, whle SSreg stads for regresso sum of squares. Whch oe of the followg statemets s true? (a) RSS for model s greater tha RSS for model, whle SSreg for model s greater tha SSreg for model. (b) RSS for model s less tha RSS for model, whle SSreg for model s less tha SSreg for model. (c) RSS for model s greater tha RSS for model, whle SSreg for model s less tha SSreg for model. (d) RSS for model s less tha RSS for model, whle SSreg for model s greater tha SSreg for model. Gve a detaled reaso to support your choce.

4 Smple Lear Regresso Model Model 8 8 6 6 y y 4 4 4 6 8 4 6 8 x x Fgure.8 Scatter plots ad least squares les 6. I ths problem we wll show that SST=SSreg+RSS. To do ths we wll show that = ( y yˆ )( yˆ y) =. (a) Show that ( y y ˆ ) = ( y y ) b ( ) x x. (b) Show that ( yˆ y) = bˆ ( x x). (c) Utlzg the fact that ˆ SXY b =, show that SXX Ÿ = ( y yˆ ) ( yˆ y) =. 7. A statstcs professor has bee volved a collaboratve research project wth two etomologsts. The statstcs part of the project volves fttg regresso models to large data sets. Together they have wrtte ad submtted a mauscrpt to a etomology joural. The mauscrpt cotas a umber of scatter plots wth each showg a estmated regresso le (based o a vald model) ad

.8 Exercses 43 assocated dvdual 95% cofdece tervals for the regresso fucto at each x value, as well as the observed data. A referee has asked the followg questo: I do t uderstad how 95% of the observatos fall outsde the 95% CI as depcted the fgures. Brefly expla how t s etrely possble that 95% of the observatos fall outsde the 95% CI as depcted the fgures.

http://www.sprger.com/978--387-967-