STAT 4385 Topic 03: Simple Linear Regression Xiaogang Su, Ph.D. Department of Mathematical Science University of Texas at El Paso xsu@utep.edu Spring, 2017
Outline The Set-Up Exploratory Data Analysis (EDA) Scatterplot Coefficient of Correlation Model Specification Model Estimation LSE of Betas Estimation of Error Variance Statistical Inference
Set-Up The Set-Up Data consists of {(x i, y i ) : i = 1,..., n} that consists of n IID copies of (X, Y ), where both the response Y and the predictor X are continuous. Want to study the association/relationship between Y and X. Examples: Revenue vs. advertising expenditure; Population over years Daily rainfall vs. barometric pressure in a place College GPA vs. high school GPA Data Layout ID Y X 1 y 1 x 1 2 y 2 x 2......... n y n x n
Set-Up A Real Example A study is conducted to investigate the relationship between cigarette smoking during pregnancy and the weights of newborn infants. A sample of 15 woman smokers kept accurate records of the number of cigarettes smoked (X ) during their pregnancies, and weights of their children (Y ) were recorded at birth. The data are given in the table. Cigarettes Birth Weight ID Per Day (X ) Weight (Y ) 1 12 7.7 2 15 8.1 3 35 6.9 4 21 8.2 5 20 8.6 6 17 8.3 7 19 9.4 8 46 7.8 9 20 8.3 10 25 5.2 11 39 6.4 12 25 7.9 13 30 8 14 27 6.1 15 29 8.6
Exploratory Data Analysis (EDA) Exploratory Data Analysis (EDA) Exploratory Data Analysis (EDA) summarizes and describes data and helps see what the data can tell us before and beyond the formal modeling or hypothesis testing task. Two Approaches: Graphical or Numerical For data in SLR, EDA is aimed to explore the bivariate association between X and Y : Graphical Displays: scatterplot Numerical Measures: correlation coefficient
Exploratory Data Analysis (EDA) Scatterplot Scatterplot A scatterplot (also called scatter chart, scattergram, scatter diagram) displays values for typically two variables for a set of data. Can be extended to 3D; using color-coded points allows for displaying another categorical variable. Inspect a scatterplot for patterns and outliers: Linear or nonlinear pattern? Positive or negative (monotonic) association between X and Y? Any potential outlier?
Exploratory Data Analysis (EDA) Scatterplot Scatterplot
Exploratory Data Analysis (EDA) Scatterplot Birthweight Example: Scatterplot birthweight 6 7 8 9 15 20 25 30 35 40 45 number of daily cigas
Exploratory Data Analysis (EDA) Coefficient of Correlation Pearson Correlation Coefficient The Pearson product moment coefficient of correlation measures the direction and strength of the linear association between two variables. cov(x,y ) The population version: ρ(x, Y ) = var(x ) var(y ) Point Estimation with Data: the sample version r(x, Y ) = = n i=1 (x i x)(y i ȳ) n i=1 (x i x) 2 n i=1 (y i ȳ) 2 n i=1 x iy i n xȳ { n i=1 x 2 i n x 2} { n i=1 y 2 i n ȳ 2}
Exploratory Data Analysis (EDA) Coefficient of Correlation Facts on ρ and r r is scaleless with 1 r 1. Direction of Linear Association: A positive r indicates a positive linear association (meaning Y increases as X increases) while r < 0 indicates a negative linear association. Their absolute values measure the strength of the linear association. The rule of thumb: When r = 1, perfect linear association. When r = 0.80, strong linear association When r = 0.50, moderate linear association When r = 0.20, weak linear association When r = 0, no linear association
Exploratory Data Analysis (EDA) Coefficient of Correlation Linear or Nonlinear Association Pearson correlation does not provide info on nonlinear association. Moreover, association does NOT imply causation.
Exploratory Data Analysis (EDA) Coefficient of Correlation Calculation of r Preliminary calculation of six quantities: { n, i x i, i y i, i x 2 i, i y 2 i, i x i y i }. Obtain x = i x i/n and ȳ = i y i/n; Next compute SS xx = i SS yy = i SS xy = i x 2 i n x 2 y 2 i n ȳ 2 x i y i n x ȳ; Finally compute r = SS xy SSxx SS yy.
Exploratory Data Analysis (EDA) Coefficient of Correlation Worksheet: Computing r Cigarettes Birth Weight ID Per Day (x i ) Weight (y i ) xi 2 yi 2 x i y i 1 12 7.7 144 59.29 92.4 2 15 8.1 225 65.61 121.5 3 35 6.9 1225 47.61 241.5 4 21 8.2 441 67.24 172.2 5 20 8.6 400 73.96 172.0 6 17 8.3 289 68.89 141.1 7 19 9.4 361 88.36 178.6 8 46 7.8 2116 60.84 358.8 9 20 8.3 400 68.89 166.0 10 25 5.2 625 27.04 130.0 11 39 6.4 1521 40.96 249.6 12 25 7.9 625 62.41 197.5 13 30 8 900 64.00 240.0 14 27 6.1 729 37.21 164.7 15 29 8.6 841 73.96 249.4 sum 380 115.5 10,842 906.27 2875.3
Exploratory Data Analysis (EDA) Coefficient of Correlation Example: Computing r We have found that n = 15, x i = 380, y i = 115.5, x 2 i = 10842, y 2 i = 906.27, and x i y i = 2875.3. Hence x = 380/15 = 25.33 and ȳ = 115.5/15 = 7.7 SS xx = x 2 i n x 2 = 10842 15 25.33 2 = 1215.33 SS yy = y 2 i nȳ 2 = 906.27 15 7.7 2 = 16.92 SS xy = x i y i n xȳ = 2875.3 15 25.33 7.7 = 50.7 Thus r = SS xy SSxx SS yy = 50.7 1215.33 16.92 = 0.3536, which shows a somewhat moderate negative linear association.
Exploratory Data Analysis (EDA) Coefficient of Correlation Inference on ρ Case I Test for zero correlation, i.e., H 0 : ρ = 0 vs. H a : ρ 0 Preferable to use the equivalent test of zero slope in a simple linear regression model. Assuming (X, Y ) follow a bivariate normal distribution with ρ = 0, the fact r n 2/ 1 r 2 t (n 2) leads to a t test.
Exploratory Data Analysis (EDA) Coefficient of Correlation Inference on ρ Case II Test on a non-zero correlation H 0 : ρ = ρ 0 Assuming (X, Y ) follow a bivariate normal distribution, Fisher s (monotonic) z-transformation converts r into an almost normally distributed variable with constant variance 1/(n 3): r = arctanh(r) = 1 2 ln 1 + r 1 r N ( 1 2 ln 1 + ρ ) 1 ρ, 1, n 3 where SE(r ) = 1/ n 3. The arctanh is the inverse hyperbolic tangent function. Implemented by R function cor.test(). The above result can also be used to compare two correlations H 0 : ρ 1 = ρ 2 based on two independent data sets.
Exploratory Data Analysis (EDA) Coefficient of Correlation Fisher s z Transform: Simulation Study Fisher s z transform helps symmetrize the distribution of r. Each data set of size n = 30 was generated from bivariate normal with true ρ = 0.7 and 100,000 simulation runs. (a) Histogram of r (b) Histgram of Transformed r Density 0 1 2 3 4 Density 0.0 0.5 1.0 1.5 2.0 0.0 0.2 0.4 0.6 0.8 r 0.0 0.5 1.0 1.5 arctanh(r)
Exploratory Data Analysis (EDA) Coefficient of Correlation Example: Hypothesis Testing on ρ Consider the smoking vs. infant birth weight example. Want to test H 0 : ρ = 0.5 vs. H a : ρ 0.5. This is equivalent to test H 0 : 1 2 ln 1 + ρ 1 ρ = 1 1 + ( 0.5) ln 2 1 ( 0.5) = 0.5493. The test statistic 1 2 ln 1 + r 1 r 1 2 ln 1 + ρ 0 1 ρ z obs = 0 1/ n 3 = ( ) 1 1 + ( 0.3536) 15 3 ln 2 1 ( 0.3536) ( 0.5493) = 0.623. RR: reject H 0 if z obs z 0.975 = 1.96 at significance level α = 0.05. Conclusion: we cannot reject H 0 since z obs = 0.623 < 1.96.
Exploratory Data Analysis (EDA) Coefficient of Correlation Confidence Interval for ρ Based on Fisher s Z transform, (1 α) 100% confidence interval for ρ can be constructed in two steps: First construct (1 α) 100% confidence interval for ρ = 1 2 ln 1 + ρ 1 ρ. Denote it as (L, U ), i.e., (L, U ) := r ± z 1 α/2 / n 3. Transform back to a (1 α) 100% confidence interval for ρ: ( exp(2l ) 1 (L, U) := exp(2l ) + 1, exp(2u ) ) 1 exp(2u, ) + 1 where the hyperbolic tangent function r = exp(2r ) 1 exp(2r ) + 1 = tanh(r ) is the inverse function for Fisher s z transform.
Exploratory Data Analysis (EDA) Coefficient of Correlation Example: CI for ρ First, a 95% CI for ρ in the infant birthweight vs. smoking example is 1 1 + ( 0.3536) ln 2 1 ( 0.3536) ± 1.96/ 15 3 = ( 0.9353, 0.9253). Transform to a 95% CI for ρ: [ ] exp{2 ( 0.9353)} 1 exp{2 ( 0.9353)} + 1, exp{2 0.9353} 1 exp{2 0.9353} + 1 = ( 0.733, 0.194). With 95% confidence, we conclude that the true correlation between number of cigars smoker and the infant birth weight is between 0.733 and 0.194. R Code: cor.test(x, y, alternative = "two.sided", method = "pearson", conf.level=.95)
Model Specification SLR: Model Specification Mathematical modeling of relationships among variables: deterministic vs. probabilistic Simple Linear Model (first-order) y i = β 0 + β 1 x i + ε i with ε i IID N (0, σ 2 ), for i = 1,..., n, where E(y i x i ) = β 0 + β 1 x i is the deterministic component; ε is the random error component; {β0, β 1 } are the regression coefficients; σ 2 is the error variance.
Model Specification Model Assumptions Four assumptions are involved in the SLR model: (Linearity): The functional relationship between the (conditional) mean response is linear in the predictor, i.e., µ i E(y i x i ) = β 0 + β 1 x i ; (Independence) ε i s are independent of each other; (Homoscedasticity) ε i s have equal variance σ 2 ; (Normality) ε i s are normally distributed. In short, ε i IID N (0, σ 2 ). It follows that y i x i N ( β 0 + β 1 x i, σ 2).
Model Specification Illustration of Statistical Assumptions
Model Specification Model Interpretation β 0 is the y-intercept, which is the mean response E(Y ) at X = 0. When β0 = 0, the regression line passes through the origin (0, 0). β 1 is the slope of the regression line, which corresponds to the amount of change in the mean response E(Y X ) with every one-unit increase in X. If β1 > 0, positive association; If β 1 > 0, negative association; What is the change in mean response (or expected change in Y ) with an a-unit increase in X? (Answer: a β 1.)
Model Estimation LSE of Betas Model Estimation There are infinitely many choices of {β 0, β 1 }, each uniquely defining a line. Want to identify the best. One criterion is the overall distance between observed y i s and their predicted values ŷ i = β 0 + β 1 x i, as measured with squared difference: Q(β 0, β 1 ) = n {y i (β 0 + β 1 x i )} 2. i=1 The least square line is given by ( ˆβ 0, ˆβ 1 ) such that Q( ˆβ 0, ˆβ 1 ) = min β 0,β 1 Q(β 0, β 1 ).
Model Estimation LSE of Betas Least Square Estimator { ˆβ 0, ˆβ 1 } are called the least square estimator (LSE) of {β 0, β 1 }. LSE can be uniquely and explicitly determined by solving the first-order necessary condition Q/ β 0 = 0 and Q/ β 1 = 0: (yi ȳ)(x i x) ˆβ 1 = (xi x) 2 = SS xy SS xx ˆβ 0 = ȳ ˆβ 1 x The resultant least square line is given by y = ˆβ 0 + ˆβ 1 x. Accordingly, the fitted value can be computed ŷ i = ˆβ 0 + ˆβ 1 x i for i = 1,..., n.
Model Estimation LSE of Betas BWT Example: LSE For the BWT example, we need SS xy = 50.7, SS xx = 1215.333, ȳ = 7.7, and x = 25.333. LSE can be computed accordingly: ˆβ 1 = SS xy SS xx = 50.7 1215.333 = 0.04172 ˆβ 0 = ȳ ˆβ 1 x = 7.7 ( 0.04172) 25.333 = 8.75683
Model Estimation LSE of Betas Example: Scatterplot with LS Fitting birthweight 6 7 8 9 15 20 25 30 35 40 45 number of daily cigas
Model Estimation LSE of Betas Properties of LSE Both ˆβ 0 and ˆβ 1 are linear combinations of y i s. To see this, first rewrite ˆβ 1 : (xi x)y i ˆβ 1 = (xi x) = ( ) x i x y i = w i y i, 2 SS xx with w i = (x i x)/ss xx. Using the fact ȳ = (1/n)y i, ˆβ 0 is also a linear combination of y i s. Why? It follows (why?) that [ σ ˆβ 2 ] 1 N β 1, SS xx ( 1n ˆβ 0 N [β 0, σ 2 + x 2 )] SS xx
Model Estimation Estimation of Error Variance Estimation of σ 2 Let sum of squared errors (SSE) denote the minimized LS criterion SSE = {y i ( ˆβ 0 + ˆβ 1 x i )} 2 = SS yy ˆβ 1 SS xy (for hand computation) It can be shown that SSE/σ 2 χ 2 (n 2). Details can be found in a course on linear model theories. It follows that E(SSE/σ 2 ) = n 2. Hence an unbiased estimator for σ 2 is given by ˆσ 2 = SSE (yi n 2 = ŷ i ) 2 MSE, n 2 where ŷ i = ˆβ 0 ˆβ 1 x i is the fitted value for x i ; MSE is the short form of Mean Square Error.
Model Estimation Estimation of Error Variance Example: Estimation of σ 2 First find SSE = SS yy ˆβ 1 SS xy = 16.92 ( 0.04172) ( 50.7) = 14.805. An estimate of the constant error variance σ 2 is given ˆσ 2 = SSE (n 2) = 14.805 (15 2) = 1.1388.
Model Estimation Statistical Inference Inference on β 1 Now we know ˆβ 1 N [ β 1, σ 2 SS xx ]. However, it involves the unknown parameter σ 2, besides β 1. How can we get rid of σ 2? This can be solved by forming a t random variable by using the following facts: ˆβ σ2 1 β 1 N (0, 1) /SS xx (n 2)ˆσ2 σ 2 χ 2 (n 2). LSE { ˆβ 0, ˆβ 1 } is independent of SSE (let s assume this). Therefore (why?), t = ˆβ 1 β 1 t ˆσ (n 2). 2 /SS xx
Model Estimation Statistical Inference Confidence Intervals It follows (why?) that a (1 α) 100% confidence interval (CI) for β 1 : ˆβ 1 ± t (n 2) ˆσ 2 1 α/2, SS xx where SE( ˆβ 1 ) = ˆσ 2 /SS xx is the standard error of ˆβ 1. Following similar arguments, we can obtain a (1 α) 100% confidence interval (CI) for β 0 : ˆβ 0 ± t (n 2) 1 α/2 ( 1 ˆσ 2 n + x 2 ), SS xx ( ) where SE( ˆβ 0 ) = ˆσ 2 1 n + x2 SS xx is the standard error of ˆβ 0.
Model Estimation Statistical Inference Example on CI: The BWT Example A 95% CI for β 1 is given by ˆβ 1 ± t (n 2) 0.975 ˆσ 2 SS xx = 0.0417 ± 2.160 = ( 0.1078, 0.0244) 1.1388 1215.33 Interpretation: With 95% confidence coefficient, we estimate that the mean infant birth weight changes by somewhere between 0.1078 and 0.0244 pounds for each additional cigarette smoked by a pregnant woman. Provide a 95% CI for the change in the mean infant birth weight caused by 10 cigarette increases smoked by a pregnant woman. Hint: Ask for 95% CI for 10β 1, which can be obtained as 10 ( 0.1078, 0.0244) = ( 1.078, 0.244).
Model Estimation Statistical Inference Example on CI: The BWT Example A 95% CI for β 0 is given by ˆβ 0 ± t (n 2) 0.975 = 8.7568 ± 2.160 = (6.9789, 10.5348) ˆσ 2 ( 1 n + x 2 SS xx ) 1.1388 ( ) 1 15 + 25.332 1215.33 Interpretation: With 95% confidence coefficient, we conclude that the mean infant birth weight of a non-smoking woman ranges from 6.9789 to 10.5348 pounds. A Word of Caution: Since we don t have data at X = 0, we are not certain whether a linear model is appropriate when extending the scope of the model to X = 0.
Model Estimation Statistical Inference Two Standard Compute Outputs Table of Parameter Estimates H 0 : β j = 0 Two-Sided Estimate SE t Test P-Value β 0 8.757 0.8230 10.64 < 0.00001 β 1 0.0417 0.0306 1.363 0.196 Analysis of Variance Table (ANOVA) Source df SS MS F P-Value Model 1 2.115 2.115 1.857 0.196 Error 13 14.805 1.139 Total 14 16.915
Model Estimation Statistical Inference Residual: Worksheet Cigarettes Birth Weight fitted residual ID Per Day (x i ) Weight (y i ) ŷ i r i 1 12 7.7 8.2562-0.5562 2 15 8.1 8.1311-0.0311 3 35 6.9 7.2967-0.3967 4 21 8.2 7.8808 0.3192 5 20 8.6 7.9225 0.6775 6 17 8.3 8.0476 0.2524 7 19 9.4 7.9642 1.4358 8 46 7.8 6.8378 0.9622 9 20 8.3 7.9225 0.3775 10 25 5.2 7.7139-2.5139 11 39 6.4 7.1299-0.7299 12 25 7.9 7.7139 0.1861 13 30 8 7.5053 0.4947 14 27 6.1 7.6305-1.5305 15 29 8.6 7.5470 1.0530 sum = 0
Model Estimation Statistical Inference Residual Plots Histogram Normal Q Q Plot Frequency 0 1 2 3 4 5 6 7 Sample Quantiles 2 1 0 1 3 2 1 0 1 2 residuals 1 0 1 Theoretical Quantiles
Model Estimation Statistical Inference Residual Plots (a) residual vs. fitted 2 1 0 1 2 1 0 1 r r (b) residual vs. x 6.8 7.0 7.2 7.4 7.6 7.8 8.0 8.2 y^ 15 20 25 30 35 40 45 x
Model Estimation Statistical Inference Naive Confidence/Prediction Bands Linear Fit with Naive Confidence/Prediction Bands LS fitted line confidence bands prediction bands birthweight 4 6 8 10 10 20 30 40 50 # of cigas Note: The critical value used here is t (n 2). This approach suffers from multiplicity. 1 α/2
Model Estimation Statistical Inference Working-Hoteling Confidence Bands Working Hoteling Confidence Bands LS fitted line naive confidence bands Hoteling confidence bands birthweight 4 6 8 10 10 20 30 40 50 # of cigas Note: The critical value used in the Working-Hoteling confidence band is W = 2 F (2,n 2) 1 α.
Model Estimation Statistical Inference Discussion Thanks! Questions?