IV. Modeling a Mean: Simple Linear Regression

Size: px

Start display at page:

Download "IV. Modeling a Mean: Simple Linear Regression"

Maurice Paul Garrison
5 years ago
Views:

1 IV. Modelng a Mean: Smple Lnear Regresson We have talked about nference for a sngle mean, for comparng two means, and for comparng several means. What f the mean of one varable depends on the value of another contnuous-type t varable? In the case of a lnear trend (.e., the value of one outcome tends to ncrease or decrease lnearly wth an ncrease n another varable), we can ft a model that descrbes that trend. The tool most often used for ths knd of analyss s called a lnear regresson model.

2 Samplng Pars of Ponts In ths settng, we have n pars of ponts (X 1,Y 1 ), (X 2,Y 2 ),,(X n,y n ). In other words, we are measurng two varables for each subject, samplng from a populaton as demonstrated n the schematc below. In ths case we are often nterested n how the varables correlate or covary. In other words, consderng the X varable as the explanatory or ndependent varable, and the Y varable as the outcome or dependent varable, the queston s: How does the average E(Y) depends on value of X? Populaton E(Y), σ 2 Y E(X), σ 2 X X ( X, Y ),( X 1 2, Y 2 ),...,( X n, Y 1 n ) Sample X, s Y 2 X 2, s Y

3 Exploratory Analyss Stat 5100 Lnear Regresson and Tme Seres As ndcated on the prevous slde, an mportant frst step n explorng the dstrbutons of pared observatons s to use summary statstcs and unvarate charts (e.g., hstograms, boxplots) to understand ther margnal or ndvdual dstrbutons. Snce we re nterested n how the varables relate to each other, a key subsequent step s to construct a two-dmensonal scatterplot, treatng Y as a functon of X. Here, each pont n the plot corresponds to one observaton n the sample, and s determned by the correpondng (X,Y) coordnate of that observaton. To plot a pont (X, Y )from the sample: Y (X, Y ) 0 0 X

Patterns of Assocaton A plot helps us quckly to

plot, of the % of physcally actve adults n each U.S.

4 Patterns of Assocaton A plot helps us quckly to dentfy relatonshps between varables that can nform how we model the relatonshp. For example, what do you observe from the followng plot, of the % of physcally actve adults n each U.S. state versus average annual temperature? (Source:

Lnear versus Nonlnear Assocatons For now, as we focus on smple regresson models, we wll look at examples where the relatonshp between the two varables of nterest s lnear.

5 Lnear versus Nonlnear Assocatons For now, as we focus on smple regresson models, we wll look at examples where the relatonshp between the two varables of nterest s lnear. However, n appled research settngs one mght observe a wde varety of patterns. What s the relatonshp between X and Y n the scatterplot to the rght? Later, when we dscuss multvarable models we wll see how such nonlnear relatonshps can be accommodated usng a regresson approach.

6 Functonal versus Statstcal Relatonshps It s also mportant to dstngush between a purely functonal assocaton between two varables and what we mght term a statstcal assocaton. A functonal relatonshp s one that s determnstc.e., a gven value of X yelds the exact same value for Y whenever an experment s repeated. An example of ths s the dstance Y travelled by an object n free fall over tme T, whch s gven by Y 1 gt g 2 2, where g s the acceleraton due to gravty at or near sea level. On the other hand, a statstcal relatonshp s one for whch a gven value of X yelds dfferent values of Y for repettons of the same experment. That s, Y s random, and t s dstrbuton may depend upon X. In the lnear regresson settng, we assume that t E(Y)l lnearly l depends d upon X.

7 Example IV.A Engneers were nterested n the effects of salt dstrbuton on the roadways wth salt concentraton n adjacent waterways. They gathered data at 20 locatons, measurng the roadway area at each ste along wth the salt concentraton n the nearby rver. The data are shown on the followng slde (they are also posted on the course webste n the fle salt.txt ). We would lke to know whether or not greater roadway area s assocated wth hgher average salt concentraton. Why s t natural to desgnate the explanatory varable and Why s t natural to desgnate the explanatory varable and response varable n ths way?

8 Example IV.A (cont d) Salt Roadway Salt Roadway Obs Concentraton Area Obs Concentraton Area

9 Example IV.A (cont d) The scatterplot for these data s gven below: SA ALT CONCENTRAT TION ROADWAY AREA

10 Example IV.A (cont d) What sort of relatonshp do you observe between the area of a gven road and the amount of salt found n nearby waterways? Based on what we observe, we would lke to answer a few questons. These mght nclude: What s the observed average ncrease n salt concentraton for ncrementally larger roadway area? Is ths average ncrease statstcally sgnfcant? That s to say, s the observed correlaton real, or can t be attrbuted to chance?

11 Example IV.A (cont d) One way of answerng these sorts of questons s to ft a model. dl That s, snce we observe a somewhat lnear assocaton (average salt concentraton appears to ncrease lnearly wth ncreased road area) we ft such a lne to the data. Knowng that a lne s determned by a slope and an ntercept, the queston s how do we select the best lne? A statstcal soluton to ths problem s the so-called least-squares ft, or lnear regresson of salt concentraton t on road area. The followng slde shows ths regresson lne overlad on the scatterplot of salt versus area.

12 Example IV.A (cont d) SALT CONCE ENTRATION ROADWAY AREA

13 The Model As noted earler, n general we sample pars of ponts (X 1,Y 1 ), (X 2,Y 2 ),,(X n,y n ), where X s referred to as the explanatory varable and Y s the response varable. Note that ths does not mply that X necessarly causes Y, although that s possble. X and Y may smply be assocated, wthout any causatve effect. We typcally want to explan changes n the average of Y due to a dfference n X. If X and Y appear to be lnearly assocated, where Y on average ncreases or decreases lnearly wth an ncrease n X, then we may post the lnear model Y, 0 1X where the ntercept β 0 and slope β 1 determne the lne, and ε models the varablty around the lne.

14 The Error Term The scatterplot n Example V.A s a typcal representaton of varables that are lnearly assocated: the ponts form a sort of cloud. That s to say, the ponts do not le on a straght lne, ndcatng that even though Y tends to ncrease or decrease lnearly wth X, a gven value of X wll not necessarly result n exactly the same value of Y. That s, the relatonshp s not determnstc. The error term ε n the model accounts for ths varablty around the lne β 0 + β 1 x. Note that β 0 + β 1 X s fxed,, not random. We further typcally assume that ε ~ N(0, σ 2 ). In other words, gven the value X, we have E(Y ) = β 0 + β 1 X, and Var(Y ) = σ 2. Therefore, Y ~ N(β 0 + β 1 X, σ 2 ).

15 The Model Parameters What do the terms n the lnear model mean? The ntercept β 0 represents the average of y for x = 0. Although the ntercept s mathematcally necessary n order to specfy the form of the lne n the model, t seldom has practcal meanng. The slope β 1 s generally the focus of nference: t represents the change n the average of y for every one unt ncrease n x. Snce we are nterested n how y changes wth x, then a nonzero slope ndcates that y and x are lnearly assocated. The varance term σ 2 represents the varablty of the data around the lne.

16 The Experment As always, our object s to nfer somethng about the underlyng model parameters by samplng from the populaton and then analyzng the data. Havng gposted the regresson model, we can thnk of the samplng n ths way: Y Populaton X 0 1X 2 ~ N(0, ), ( X, Y ),( X 1 2, Y 2 ),...,( X n, Y 1 n ) Sample estmate t β 0, β 1, and σ from data.

17 Example IV.B Suppose that the purty of a chemcal soluton Y s related to the amount of catalyst X through a lnear regresson model wth β 0 = 123.0, β 1 = 2.16, and wth an error standard devaton of σ = What s the expected value of the purty when the catalyst level s 20? How much does the average purty change when the catalyst amount s ncreased by 10? What s the probablty that the purty s less than 60 when the catalyst level s 25?

18 In practcal research settngs, we do not know the actual parameter values. As ndcated on the schematc two sldes prevous, we sample from the populaton wth the posted regresson model and then estmate the regresson parameters from the data. Example IV.C Wrte out a lnear model for the experment descrbed n Example IV.A. Clearly nterpret each of the parameters of the model.

19 How do we estmate the model parameters? That s to say, what s the best lne, based on the data? In statstcal applcatons, we choose the lne that acheves the mnmum squared dstance between tself and the collectve observed data ponts. Note that the dstance from a gven value Y and ts assocated pont on the lne s gven by Y (β 0 + β 1 X ). We call ths the resdual. It turns out that we compute estmates of the slope, ntercept, and varance that mnmze the sum of the squared resduals. The resultng estmates of the slope and ntercept are gven by b n X Y nxy 1, b0 Y b1. n 2 2 X nx 1 X 1

20 Dervaton of Parameter Estmates One way of thnkng about how b 0 and b 1 are derved s to consder drect mnmzaton of the sum of squared resduals: Q n 1 2 n 1 [ Y ( 0 X 1 )] 2. We sometmes refer to Q as the objectve functon. How can we mnmze ths functon wth respect to β 0 and β 1? (Some dscusson about ths s contaned n Secton 1.6 of the text, although the techncal detals are not that mportant.)

21 Example IV.D Usng the data gven n Example IV.A, ft the model specfed n Example IV.C. The necessary summary statstcs are gven below: X Y X n 20 X Y What does the estmated ntercept represent n the model ft? Interpret the estmated slope what does t say about the observed relatonshp between road area and average salt concentraton?

22 Interpretng the Model Ft Note that once we have obtaned our estmates of the ntercept and slope, the ftted value for y s gven by Yˆ b b X 0 1. There are two ways of vewng such a ftted value: The ftted value s our predcted Y for the gven X. The ftted value s our estmate of the average Y for the gven X.

23 Example IV.E Based on the model ft n Example IV.D, what s the predcted salt concentraton when the adjacent road area s 0.75? What s our estmated average salt concentraton level when the adjacent road area s 0.75? What s the predcted salt concentraton when the road area s 2.0? Why should we be cautous about ths last predcton?

24 Estmatng σ 2 The last of the three parameters that we need to estmate s the model varance, whch represents the varablty of the y s around the regresson lne. Note that the estmated resduals based upon the model ft are gven by e Y Y ˆ Y ( b b X ), 1,...,. 0 1 n Therefore, our estmate of the model varance σ 2 s the observed average squared resdual, also called the mean square error (MSE): s n ( ˆ ) 2 n 2 SSE e. 1 Y 1 Y n 2 n 2 n 2

25 Example IV.F Stat 5100 Lnear Regresson and Tme Seres The table below shows both the observed salt concentraton and predcted salt concentraton (usng the ftted lne) for the observatons n Example IV.A. Salt Concentraton Salt Concentraton Obs # Observed Predcted Obs # Observed Predcted

26 Example IV.F (cont d) Based on the ftted values, we can see that the resdual for the frst observaton s 2.21, for the second observaton the resdual s 0.59, and so forth. The average squared resdual therefore s gven by s (-2.21) 18 2 e 2 (0.59) 2 (1.60)

27 Inference for the Slope β 1 Remember, the fundamental queston n a lnear regresson analyss s whether the dependent and ndependent varables are lnearly assocated. As always, there are two aspects to the analyss: Do we observe a slope that s dfferent from zero? That s, does the average of the outcome varable depend on the value of the explanatory varable? Do the data provde evdence that the slope s sgnfcantly dfferent from zero? That s, can we nfer from our data that the relatonshp we observe holds for the underlyng populaton? To address the second ssue we need to know somethng about the To address the second ssue, we need to know somethng about the dstrbuton of the our estmated slope.

28 Dstrbuton of the Estmated Slope Not surprsngly, t turns out that the estmated slope b 1 s approxmately normally dstrbuted (provded that the sample s random and n most cases that the sample sze s suffcently large). The mean of the dstrbuton b t of b 1 s β 1. The estmated t standard error s gven by n 2 1/2 ( X ), s.e.( b1 ) s{ b1} s / 1 X where s 2 s the model MSE (or estmate of model varance σ 2 ). Snce we need to rely on the estmated standard error (.e., σ s unknown), then we use the t(n 2) ( ) dstrbuton to obtan a confdence nterval and hypothess test for β 1.

29 Confdence Interval and Hypothess Test for β 1 A(1 α)100% confdence nterval for β 1 s therefore gven by b1 t( 1 / 2; n 2)s{ b1}. We also would lke to test the null hypothess H 0 : β 1 = 0versusthe alternatve hypothess H A : β 1 0. A test statstc for assessng the evdence aganst H 0 s gven by t b1 0. s{ b } Under H 0, ths test statstc approxmately follows the t(n 2) dstrbuton. The p-value s therefore gven by 2P{t(n 2) t }. Note that we can concevably test aganst any specfc value of β Note that we can concevably test aganst any specfc value of β 1, although 0 s generally the value of nterest. 1

30 Example IV.G Gve a 95% confdence nterval for the slope parameter n the model of Example IV.C, based upon the observed data gven n Example IV.A. Interpret ths confdence nterval. State the null and alternatve hypotheses for testng a lnear assocaton between road area and average salt concentraton. Explan these hypotheses. Carry out a test of the null hypothess of no lnear assocaton. Carry out a test of the null hypothess of no lnear assocaton. What s the p-value for ths test? Is there evdence of a relatonshp between road area and average salt concentraton?

31 Measurng the Strength of Assocaton Note that the slope s one measure of the lnear assocaton between two contnuous varables t tells you how much the average of the outcome varable changes wth respect to a one-unt ncrease n the explanatory varable. However, the estmated slope tells you nothng about the varablty of the ponts about the lne. Correlaton s a measure of the strength of assocaton between two varables that reflects the degree of varablty around the ftted lne. It s another popular summary statstc for llustratng the degree to whch varables are lnearly assocated.

32 The slope tself does not always reflect the strength of assocaton For example, note that n the two plots below we observe two data sets wth approxmately the same estmated slope. However, the assocaton n the frst case looks much stronger, as the cloud of ponts more tghtly clusters about the regresson lne.

33 The Correlaton Coeffcent The correlaton ρ s another populaton parameter that we can estmate from the data. We typcally use r to denote our estmate of ρ. The so-called correlaton coeffcent r has several mportant features: r has a range of 1 to 1. It s an ndex, and has no unts. The closer r s to 1, the stronger the postve lnear assocaton (r = 1 ndcates perfect postve correlaton). The closer r s to 1, the stronger the negatve lnear assocaton (r = 1 ndcates perfect negatve correlaton). An r close to zero ndcates weak lnear assocaton. If r = 0, ths means no lnear assocaton. r measures lnear assocaton only. Two varables can be hghly correlated n a nonlnear way, nevertheless yeldng r close to 0.

34 Example IV.H Plots llustratng varous values of r:

35 Computng r Our estmated correlaton coeffcent for two varables X and Y s gven by ) )( ( Y Y X X r n n n XY ) ( ) ( 1 1 Y Y X X n n n ny Y nx X nxy X Y n n n

36 Example IV.I Gven the fve summary statstcs n Example IV.D, and that n 1 Y , what s the correlaton coeffcent between salt concentraton and roadway area?

37 Inference for r To carry out a test of H 0 : ρ = 0 versus the alternatve hypothess H A : ρ 0, we can use ths statstc: t r n 2, 2 1 r whch approxmately follows a t(n 2) dstrbuton. In fact, t turns out that ths statstc s algebracally equvalent to the t statstc for testng that the regresson slope s equal to zero. The p-value for ths test s gven by 2P{t(n 2) t }.

38 Example IV.J Carry out a test of the null hypothess that the salt concentraton and road area are not correlated, versus the alternatve hypothess that they are correlated. What s the p-value of ths test? Interpret ths result n words.

39 Inference for Means and Predctons In addton to nferences about the slope, we may also want to construct tests and confdence ntervals for the regresson lne, tself. We wll talk about nference for: (1) The average of Y gven a correspondng value of X, and (2) A predcted value Y gven a correspondng value of X.

40 Inference for a Mean Suppose we want to estmate the average of Y for a gven value of X, denoted by X h. Our estmated average s Ths estmate has a standard error gven by. ˆ 1 0 h h X b b Y s est ate as a sta da d e o g ve by, ) ( ) ( 1 } ˆ { 2 2 X X X X s Y s h h ) ( } { 2 X X n h where s 2 s the regresson MSE (our estmate of the error varance σ 2 ). where s s the regresson MSE (our estmate of the error varance σ ).

41 Inference for a Mean, contnued If E{Y h } represents the actual mean of Y at the value X h, then the statstc Yˆ h E { Yh } s{ Yˆ } fll follows a t(n 2) dstrbuton. b t A 1 α confdence nterval for the mean of Y s therefore gven by Yˆ t 1 / 2; n 2) s{ Yˆ }. h h ( h

42 Example IV.K For the roadway data, compute and nterpret a 95% confdence nterval for the average salt concentraton when the correspondng roadway area s 1.0 m 2.

43 Inference for a Predcton As opposed to estmatng a mean, suppose nstead that we want to make a predcton for a sngle addtonal observaton. Agan, as wth the mean, our estmated predcted value for a gven X h s computed as Yˆ h b b X 0 1 h However, n ths case the estmated predcton has a standard error gven by 1/ ( X ) h X s{pred} s1. 2 n ( X X ) Note the dfference between ths standard error and the one gven for an estmated mean. The extra varablty arses snce here we are estmatng a value for a sngle observaton as opposed to an average over many observatons..

44 Inference for a Predcton, contnued If Y h(new) represents a randomly sampled value of Y for a correspondng X h, then the statstc Y Y h (new) Yˆ h s{pred} fll follows a t(n 2) dstrbuton. b t A 1 α confdence nterval for a predcted Y h(new) s therefore gven by ˆ t( 1 / 2; n 2) s{pred}. Y h

45 Example IV.L For the roadway data, compute and nterpret a 95% confdence nterval for the predcted salt concentraton when the correspondng roadway area s 1.0 m 2.

46 Confdence and Predcton Bands Researchers often fnd t useful to construct a confdence nterval for the regresson lne over the entre range of X-values. We can accomplsh ths by computng the confdence ntervals presented on the prevous sldes ether for the means or the predctons (dependng d on the nvestgatve focus). Ths s obvously accomplshed n general by usng computer software.

47 Example IV.M The plot on the followng slde llustrates confdence and predcton bands for the roadway data. Note the relatve wdths of the ntervals delmted by both sets of bounds. How do you explan the wder ntervals for the predcton bands?

49 Analyss of Varance (ANOVA) for Regresson Important nformaton about a regresson analyss s generally dsplayed n an ANOVA table. The underlyng prncple s that the varaton of the Y (or outcome) varable arses from two sources: Total Varaton n Y = Varaton due to Regresson + Unexplaned (Resdual) Varaton

50 Sources of Varaton In more mathematcal terms ths relatonshp can be expressed as: In more mathematcal terms, ths relatonshp can be expressed as: n n n Y Y Y Y Y Y ) ˆ ( ) ˆ ( ) ( where: ) ˆ (, ) ( SSR Y Y SSTO Y Y n n. ) ˆ (, ) ( SSE Y Y SSR Y Y n 1

51 ANOVA F Test for Regresson Coeffcents It turns out that the ANOVA approach provdes us wth a useful way of testng coeffcents and comparng models n a varety of settngs (partcularly for multple regresson wth several varables). For the smple lnear regresson model, the ANOVA F statstc for testng H 0 : β 1 = 0 versus H A : β 1 0 s gven by MSR F, MSE where MSR s the mean squared error due to regresson, or MSR = SSR/df(Regresson); and MSE s the mean squared error s 2, or MSE = SSE/df(Error). There are generally n 1 df assocated wth SSTO. Aswe ve There are generally n 1 df assocated wth SSTO. As we ve dscussed prevously, n the smple model there are n 2 df for SSE, leavng 1 df for SSR.

52 ANOVA F Test for β 1 n the Smple Model For the smple regresson model, relatvely large values of F provde evdence aganst the null H 0 : β 1 = 0, and values of F close to 1.0 ndcate lttle or no evdence aganst the null. The p-value for ths F test s determned by computng the upper-tal probablty for the observed statstc wth respect to the F(1,n 2) dstrbuton. Note that n the smple case, t turns out that the ANOVA F test and the t test tfor the slope (dscussed dearler) are dentcal. That ts, MSR b 1 F t. MSE s{ } b 1

53 ANOVA Table All of ths s summarzed n a table, typcally n ths famlar form: Source Degrees of Freedom Sum of Squares Mean Squares F-statstc p-value Regresson 1 SSR MSR MSR/MSE Error n 2 SSE MSE Total n 1 SSTO

54 Example V.N The ANOVA table for the roadway data s partally completed below. Can you fll n the mssng nformaton? Degrees of Sum of Mean Source Freedom Squares Squares F-statstc p-value Regresson Error 18 Total

55 Example IV.N (cont d) Based on the results of the ANOVA procedure, what are your conclusons regardng the assocaton between roadway area and salt concentraton?

56 Checkng Model Assumptons What are some of the underlyng assumptons we have dscussed wth respect to the smple regresson model?

57 Resduals and Standardzed Resduals Examnng the observed resduals can provde key dagnostc nformaton about whether model assumptons are volated. Recall that the resdual e for the th subject s gven by e Y Y ˆ, 1,..., n. Snce the actual varance of the resduals s σ 2, the estmated varance s gven by the MSE. It turns out that computng the actual standard devaton of the resduals s a lttle more complex than smply takng (MSE) 1/2, but ths estmate s not too far off. We therefore defne what s referred to as the semstandardzed or semstudentzed resdual as * e e. MSE

58 Exploraton of Resduals A regresson analyss s generally accompaned by an examnaton of the resduals or standardzed resduals, to assess the lnearty of the relatonshp between X and Y, the normalty of the resduals, the constancy of the resdual varance across the range of X, the ndependence d of the resduals, effects of potental outlers (both wth respect to X and Y), and whether any addtonal explanatory factors may have been omtted.

59 Dagnostcs for Lnearty A scatterplot s one of the best ways to assess the nature of the X-Y relatonshp, but a plot of the resduals (versus ether the predctor varable X or the ftted values) can also reveal patterns that could ndcate nonlnearty. Note the nonlnear pattern n the plots below:

60 Example IV.O Stat 5100 Lnear Regresson and Tme Seres Is there anythng n the resdual plot (below) for the roadway data to ndcate nonlnearty?

61 Evaluatng Non-constant Varance Stat 5100 Lnear Regresson and Tme Seres Resdual plots can also be very useful n assessng whether the varance remans constant across the range of X. The plots below llustrate a classc pattern where ths assumpton s not met: In examnng the resdual plot n Example IV.O, s there any evdence of nonconstant varance for the roadway data?

62 Evaluatng Dependence Between Resduals Ths can sometmes be trcky, but dependence most often manfests tself wth respect to the sequence, or temporal orderng, of the measurements. Where an nvestgator knows the order n whch observatons were sampled, he or she ought to plot the resduals versus samplng sequence to ensure there s no systematc correlaton between contguous observatons. Note that nformaton about samplng order may not always be avalable.

63 Example IV.P Stat 5100 Lnear Regresson and Tme Seres Consder the data plotted below. How do the plots look n terms of lnearty and varance?

64 Example IV.P (contnued) The plot below shows the relatonshp between the resduals and the order of measurements for the data plotted on the prevous slde. What do you observe?

65 Outlers In addton to ntal unvarate exploratory analyses, resdual plots can be useful for dentfyng outlers. Note that t outlers wth respect to the dstrbuton b t of Y or X n a regresson settng can potentally nfluence the model ft n dramatcally dfferent ways. In some cases, outlers may not have any apprecable effect on the analyss. Smply dentfyng outlers s no reason to smply throw them out such observatons must be examned ndvdually to (hopefully) explan why they have relatvely extreme values. An outler may exst tbecause of mscodng, ncorrect samplng, or even just sheer randomness.

66 Example IV.Q Note the outler below wth respect to the dstrbuton of the Y varable. What effect (f any) does ths observaton have on the model ft? OUTLIER

67 Example IV.Q (contnued) A plot below of the resduals for the data on the prevous slde clearly dentfes the outler. Interestngly, the observaton appears to be exertng very lttle nfluence on the model ft. OUTLIER

68 Example IV.R Stat 5100 Lnear Regresson and Tme Seres In the plot below, the outler s extreme n partcular wth respect to the dstrbuton of the X varable. These knds of outlers can be partcularly problematc n terms of ther nfluence on model ft. OUTLIER

69 Normalty of Resduals A conventonal unvarate analyss (.e., wth summary statstcs, boxplots, etc.) can be useful n examnng the dstrbuton of resduals. The so-called normal probablty plot (also known as a normalquantle or Q-Q plot) s also useful for assessng the normalty of resduals. AQ-Q plot for a gven sample s constructed by plottng the emprcal AQ Q plot for a gven sample s constructed by plottng the emprcal standardzed quantles for the data aganst the quantles that would be expected gven the data arse from a normal dstrbuton.

70 Example IV.S AQ Q-Q Q plot for the resduals from the roadway data model s shown on the followng slde. Note that the f the plotted data are at least approxmately normally dstrbuted, then the ponts should roughly follow a straght lne. What s your nterpretaton of ths plot?

71 Example IV.S (contnued)

72 Example IV.T The followng two sldes llustrate examples of Q-Q Q plots for non-normal data. What s the nature of the devaton from normalty n each case?

73 Example IV.T (contnued)

74 Example IV.T (contnued)

75 Varable Transformatons Problems wth nonlnearty, non-constant varance, or nonnormalty can frequently be fxed wth a smple transformaton. Logarthmc and power transformatons are the most wdely appled. The followng example llustrates the utlty of ths approach.

76 Example IV.U The data for ths example come from a study of water use and household ncome n Concord, NH, durng the summer of 1981 (the dataset s posted as concord.txt on the course webste). The followng three sldes contan a scatterplot wth ftted regresson lne along wth two resdual plots. What potental problems, f any, do you observe wth respect to model assumptons?

77 Example IV.U (contnued) Yˆ X

78 Example IV.U (contnued)

79 Example IV.U (contnued)

80 Example IV.U (contnued) In ths case, because of the postve skew of the water use dstrbuton, as well as the ncreasng varance of the resduals, t would be useful to explore a log transformaton or a transformaton usng a power < 1. The followng sx sldes llustrate alternatve fts for these data, frst usng a log transformaton, and second wth a transformaton usng a power of 0.3 for water use. What are your conclusons? How do you nterpret the ftted coeffcents n each case?

81 Example IV.U (contnued) log( Y ˆ) X

82 Example IV.U (contnued)

83 Example IV.U (contnued)

84 Example IV.U (contnued) Y ˆ X

85 Example IV.U (contnued)

86 Example IV.U (contnued)

87 Addtonal Notes on Dagnostcs Assessng the normalty of resduals can be a bt trcky under certan crcumstances. For example, resduals may actually be normally dstrbuted, but plots (such as boxplots or Q-Q plots) can appear nonnormal because of () randomness (especally wth a small sample sze), or () the excluson of one or more addtonal key varables. It s usually a good dea to check other assumptons frst such as lnearty and nonconstant varance before checkng normalty. Even where the outcome varable sn t exactly normally dstrbuted, substantve conclusons based on a regresson model ft may stll be fundamentally correct gven a relatvely large sample sze. Ths s n some sense due to the fact that we are estmatng an average, meanng that the Central Lmt Theorem apples to the dstrbuton of the ftted mean. We have not llustrated here wth an example, but to check the possblty that other varables are addtonally assocated wth Y, we generally begn smply by constructng addtonal scatterplots.

Statistics for Economics & Business

Statistics for Economics & Business Statstcs for Economcs & Busness Smple Lnear Regresson Learnng Objectves In ths chapter, you learn: How to use regresson analyss to predct the value of a dependent varable based on an ndependent varable