TMA4255 Applied Statistics V2016 (5)

TMA4255 Applied Statistics V2016 (5) Part 2: Regression Simple linear regression [11.1-11.4] Sum of squares [11.5] Anna Marie Holand To be lectured: January 26, 2016 wiki.math.ntnu.no/tma4255/2016v/start

2 Part 2: Regression analysis Y : response, dependent variable. x: independent variable, regressor, covariate, predictor, explanatory variable. Goal: describe Y as a function of one or many x s. A statistical description based on a law of nature, some local approximation, the correlation between variables, trends over time, etc. Linear regression: Y is then a linear function of one or many (maybe transformed) x s. Simple linear regression: only one x. Multiple linear regression: many x s.

3 Wood quality Wood density is measure of wood quality. Within the wood industry there is a need to develop techniques to reduce the duration and the cost of wood analyses. Wood stiffness is generally evaluated by the determination of the modulus of elasticity in static bending, and lately sonic measurements have been investigated - this is expensive. We will look at a data set of simultaneous measurements of wood stiffness and wood density, to see if density can be used as a substitute for stiffness. Comment: source unknown. This data set has been used in this course for several years, data file taken from John Tyssedal

4 Wood quality x =wood density, and Y = log wood stiffness.

5 Simple linear regression [11.1-11.4] Previously, one random variable Y = µ + ε where ε was normally distributed with E(ε) = 0 and Var(ε) = σ 2. And we wrote Y N(µ, σ 2 ). The simple linear regression model: Y i = β 0 + β 1 x i + ε i and ε i is normally distributed with E(ε i ) = 0 and Var(ε i ) = σ 2, for i = 1,..., n.

6 Useful identities (x i x) = x i n x = n x n x = 0 S xy = = = (x i x)(y i ȳ) = (x i x)y i (x i x)ȳ (x i x)y i ȳ( x i n x) = x i y i x y i = (x i x)y i x i y i n xȳ

7 Useful identities S xx = = = (x i x) 2 = x i (x i x) x 2 i n x 2 (x i x)(x i x) x(x i x) = x 2 i x i x

8 Least squares estimators for β 0 and β 1 Given a data set, {(x i, y i ); i = 1,..., n}, the least squares estimators B 0 and B 1 for the parameters β 0 and β 1 are given as: B 1 = n (x i x)y i n (x i x) 2 = n (x i x)(y i Ȳ ) n (x i x) 2 n B 0 = Ȳ B 1 X = Y i B n 1 x i n These are also the maximum likelihood estimators, and the estimates are called b 0 and b 1.

9 Properties for β 0 and β 1 β 0 β 1 n Estimator B 0 = Ȳ B 1 x B 1 = (x i x)y i n (x i x) 2 Distribution Normal Normal Mean E(B 0 ) = β 0 E(B 1 ) = β 1 Variance Var(B 0 ) = σ2 n x i 2 n σ n (x. Var(B i x) 2 1 ) = 2 n (x i x) 2 exercise 3, problem 1. See

10 MINITAB The regression equation is log(stiff) = 8,25 + 0,125 density Predictor Coef SE Coef T P Constant 8,2516 0,1281 64,39 0,000 density 0,125190 0,007767 16,12 0,000 S = 0,243964 R-Sq = 90,3% R-Sq(adj) = 89,9%

11 Interpretation? We have wood density as covariate (x) and log wood stiffness as response (y), and have fitted a simple linear regression. What does b 0 = 8.25 and b 1 = 0.125 mean? A: If the wood density increase with 1, the log of wood stiffness increase with 8.25. B: If the wood density increase with 1, the log of wood stiffness increase with 0.125. C: I don t know. Vote at clicker.math.ntnu.no, class room TMA4255.

12 Correlation? DEF 4.5: Let X and Y be two random variables with covariance σ XY and variances σx 2 and σ2 Y, respectively. The correlation coefficient for X and Y is ρ XY = Cov(X, Y ) Var(X) Var(Y ) = σ XY σ X σ Y TEO 4.4: The covariance of two random variables X and Y with means µ X = E(X) and µ Y = E(Y ), is given by σ XY = Cov(X, Y ) = E(X Y ) E(X) E(Y ) = E(X Y ) µ X µ Y

13 b 1 and r An estimate of the correlation beween two RV s is the Pearson correlation coefficient: S xy r =. Sxx S yy The simple linear regression slope b 1 is given as b 1 = S xy S xx.

14 Estimator for σ 2 An unbiased estimator for σ 2 is S 2 = SSE n n 2 = (Y i Ŷ i ) 2 n 2 n = (Y i B 0 B 1 x i ) 2 n 2 This is not the maximum likelihood estimator. Further, V = (n 2)S2 σ 2 is chi-squared distributed with n 2 degrees of freedom.

15 MINITAB The regression equation is log(stiff) = 8,25 + 0,125 density Predictor Coef SE Coef T P Constant 8,2516 0,1281 64,39 0,000 density 0,125190 0,007767 16,12 0,000 S = 0,243964 R-Sq = 90,3% R-Sq(adj) = 89,9% Have inserted S for σ in the formulas: Var(B 0 ) = Var(B 1 ) = σ 2 n (x i x) 2 σ2 n x 2 i n n (x i x) 2 and

16 Sum of squares [11.5, 11.8] SST = n (y i ȳ) 2, total sum of squares. SSE = n (y i ŷ i ) 2, error sum of squares. SSR = n (ŷ i ȳ) 2, regression sum of squares. Coefficient of determination: R 2 = 1 SSE SST R 2 adj = 1 SSE/(n 2) SST/(n 1) Two RV s X and Y with a linear correlation coefficient of ρ has R 2 = ρ 2.

17 MINITAB The regression equation is log10(stiff) = 8,25 + 0,125 density S = 0,243964 R-Sq = 90,3% R-Sq(adj) = 89,9% Analysis of Variance Source DF SS MS F P Regression 1 15,464 15,464 259,81 0,000 Residual Error 28 1,667 0,060 Total 29 17,130

18 Coefficient of determination The relative amount of totalt variance that is explained by the simple linear regression model. R 2 = 1 SSE SST