Lecture Linear Regression with One Predictor Variablep - Basics - Meaning of regression parameters p - β - the slope of the regression line -it indicates the change in mean of the probability distn of Y per unit increase in X - β 0 - the Y intercept of the regression line - Estimation of regression function 6 p5 - Least squares method p5 - Example SAS output SAS code for Q8 p37 options ls=75 nodate; data crime; input Y X; datalines; 8487 74 879 8 836 8 80 8 646 87 900 66 363 9 8040 88 698 83 758 76 ; proc plot; plot Y*X; proc reg ; model Y=X ; run;
Q8 p37 The SAS System Plot of Y*X Legend: A = obs, B = obs, etc Y 4000 + A A 000 + A A A A A B A A A 0000 + A A A A A A A A A A A A A A B A 8000 + A A A A A A A A A A AA A A AA A A A C A A 6000 + A B A A A A A B A A A A A A A A A 4000 + A A A A A A A A A A A 000 + A ---+--------+--------+--------+--------+--------+--------+--------+-- 60 65 70 75 80 85 90 95 X ^L
The SAS System $ Model: MODEL Dependent Variable: Y Analysis of Variance Sum of Mean Source DF Squares Square F Value Prob>F Model 93469467 93469467 6834 0000 Error 8 45573659 555779 C Total 83 5487360756 Root MSE 356995 R-square 0703 Dep Mean 7038 Adj R-sq 060 CV 333493 Parameter Estimates Parameter Standard T for H0: Variable DF Estimate Error Parameter=0 Prob > T INTERCEP 058 37764698 660 0000 X -7057589 4574378-403 0000 3
- Properties of least squares p8 - Point estimation of mean response p - Residuals p - Example In q8 p37, the regression equation is: Y = 058-7057589X When X = 74, the predicted value of Y (ie Y ˆ ) = 058-7057589 x 74 = 78954367 Residual = Y Yˆ =8487-78954367 = 595633 SAS output options ls=75 nodate; data crime; input Y X; datalines; 8487 74 879 8 698 83 758 76 ; proc plot; plot Y*X; proc reg ; model Y=X ; output out=a p=pred r=resid; proc print data=a; var Y pred resid; run; 4
The SAS System OBS Y PRED RESID 8487 789504 5996 879 653043 64857 3 836 6700 66099 4 80 6700 5899 5 646 567756 56844 6 900 95964-5964 7 656 89849-35749 8 5873 6700-880 9 7993 789504 9796 0 793 653043 4057 - Properties of the fitted line p3 o ei = 0 n i= n o Xe i i= 0 i= n o 3 Ye ˆ i i= 0 i= o - Estimation of the error variance - Normal error regression model 8 p6 - MLE s of parameters Practice Problems 7, 3, 34 5
Chap- p40 Inference in Regression and Correlation Analysis Sampling distribution of b p4 -unbiased -linear -has minimum variance -estimated variance p43 b β -sampling distribution of p44 SE( b ) -CI s for β p45 -Tests concerning β p47 -Sampling distribution of β0 -mean, variance - sampling distribution of (b0-β0)/se(b0) -CI s for β0 3 Some considerations on making inferences concerning β and β0 p50 -Effects of departure from normality p50 -Spacing of x-values p50 -Power of tests p50 4 Interval Estimation of EY ( h ) p5 -Sampling distribution of Y p5 ˆh Y -Sampling distribution of ˆ h E( Yh) ~t_(n-) sy ( ˆ ) -CI s for EY ( h ) p54 -Example 5 Prediction of New Observations p55 - Example - Prediction of m new observations for given X h h - Confidence band for Regression line p6 - Working-Hotelling - α confidence band for the regression line ˆ Y ± Ws{ Y } where W = F ( α;, n ) h ˆh 7 ANOVA approach to Regression analysis p63 - Partitioning the Total SS P63 SSTO = SSR + SSE 6
- ANOVA table p67 - Expected MS p68, E(SSE) = σ and E(SSR) = σ + β ( X X) i - F test for H0 : β = 0 vs H: β 0 p69 - Example Q8 above - Equivalence of F and the t test for H0 : β = 0 vs H: β 0 p7 8 General linear test procedure p7 9 Descriptive measures of linear association between X and Y p74 R-sq = SSR/SSTO = SSE/SSTO - Example: See SAS output for Q 8 above - Limitations of R-sq p75 Misunderstanding A high R-sq indicates that useful predictions can be made This is not necessarily correct Example: In the Toluca Company example R-sq = 08, but the 90% CI for a new lot consisting of 00 units is wide (33, 50656) and not precise enough to permit management to schedule workers effectively Model: MODEL Dependent Variable: Y The SAS System Analysis of Variance Sum of Mean Source DF Squares Square F Value Prob>F Model 53775808 53775808 05876 0000 Error 3 54854599 383756 C Total 4 3070304000 Root MSE 48833 R-square 085 Dep Mean 38000 Adj R-sq 0838 CV 563447 Parameter Estimates Parameter Standard T for H0: Variable DF Estimate Error Parameter=0 Prob > T INTERCEP 6365859 67743389 38 0059 X 35700 0346976 090 0000 7
Misunderstanding A high R-sq indicates that the estimated regression line is a good fit : - not necessarily correct Misunderstanding 3 An R-sq near zero indicates that X and Y are not related: - not necessarily correct -Coefficient of correlation A measure of linear association between Y and X when both Y and X are random is the coefficient of correlation This measure is the signed square root of R-sq: r =± R A plus or minus sign is attached to this measure according to whether the slope of the regression line is positive or negative Example: For the Toluca company example above, R-sq = 08 Treating X as a random variable r = sqrt(08) = 0907 0 Considerations in applying regression analysis p77 - read p77 -Inferences on correlation coefficients p83 The coefficient of correlation ( ρ ) between two rvs Y and Y = σ = E{( Y µ )( Y µ )} - A point estimator of : The MLE of ρ is (p83) ρ - Testing H 0 : ρ = 0 vs H: ρ 0 - Test statistic is * r n t = t( n ) r r = ( Yi Y)( Yi Y) ( ( Yi Y) ( Yi Y) ) / Example: In the Toluca company example above r = 0907 If we are interested in testing H0 : ρ 0 vs H: ρ > 0, σ σσ where 8
t * 0907 5 = = 05 >t(5-, 095) and so reject H 0 : ρ 0 0907 9