Statistics for EES Linear regression and linear models

Size: px

Start display at page:

Download "Statistics for EES Linear regression and linear models"

Kerrie Welch
6 years ago
Views:

1 Statistics for EES Linear regression and linear models Dirk Metzler July 2010

2 Contents 1 Univariate linear regression: how and why? 2 t-test for linear regression 3 log-scaling the data 4 Checking model assumptions 5 Linear regression example with scaling 6 Why it s called regression

3 Contents Univariate linear regression: how and why? 1 Univariate linear regression: how and why? 2 t-test for linear regression 3 log-scaling the data 4 Checking model assumptions 5 Linear regression example with scaling 6 Why it s called regression

4 Univariate linear regression: how and why? Griffon Vulture Gypus fulvus German: Ga nsegeier photo (c) by Jo rg Hempel

5 Univariate linear regression: how and why? Prinzinger, R., E. Karl, R. Bögel, Ch. Walzer (1999): Energy metabolism, body temperature, and cardiac work in the Griffon vulture Gyps vulvus - telemetric investigations in the laboratory and in the field. Zoology 102, Suppl. II: 15 Data from Goethe-University, Group of Prof. Prinzinger Developed telemetric system for measuring heart beats of flying birds

6 Univariate linear regression: how and why? Prinzinger, R., E. Karl, R. Bögel, Ch. Walzer (1999): Energy metabolism, body temperature, and cardiac work in the Griffon vulture Gyps vulvus - telemetric investigations in the laboratory and in the field. Zoology 102, Suppl. II: 15 Data from Goethe-University, Group of Prof. Prinzinger Developed telemetric system for measuring heart beats of flying birds Important for ecological questions: metabolic rate.

7 Univariate linear regression: how and why? Prinzinger, R., E. Karl, R. Bögel, Ch. Walzer (1999): Energy metabolism, body temperature, and cardiac work in the Griffon vulture Gyps vulvus - telemetric investigations in the laboratory and in the field. Zoology 102, Suppl. II: 15 Data from Goethe-University, Group of Prof. Prinzinger Developed telemetric system for measuring heart beats of flying birds Important for ecological questions: metabolic rate. metabolic rate can only be measured in the lab

8 Univariate linear regression: how and why? Prinzinger, R., E. Karl, R. Bögel, Ch. Walzer (1999): Energy metabolism, body temperature, and cardiac work in the Griffon vulture Gyps vulvus - telemetric investigations in the laboratory and in the field. Zoology 102, Suppl. II: 15 Data from Goethe-University, Group of Prof. Prinzinger Developed telemetric system for measuring heart beats of flying birds Important for ecological questions: metabolic rate. metabolic rate can only be measured in the lab can we infer metabolic rate from heart beat frequency?

9 Univariate linear regression: how and why? griffon vulture, , 16 degrees C metabolic rate [J/(g*h)] heart beats [per minute]

10 Univariate linear regression: how and why? griffon vulture, , 16 degrees C metabolic rate [J/(g*h)] heart beats [per minute]

11 Univariate linear regression: how and why? vulture day heartbpm metabol mintemp maxtemp medtemp / / / / / / / / / / / (14 different days)

12 Univariate linear regression: how and why? > model <- lm(metabol~heartbpm,data=vulture, subset=day=="17.05.") > summary(model) Call: lm(formula = metabol ~ heartbpm, data = vulture, subset = day "17.05.") Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) e-08 *** heartbpm e-14 *** --- Signif. codes: 0 *** ** 0.01 * Residual standard error: on 17 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 1 and 17 DF, p-value: 2.979e-14

13 Univariate linear regression: how and why? 0 0

14 Univariate linear regression: how and why? 0 0

15 Univariate linear regression: how and why? y 3 y 2 y x x x 1 2 3

16 Univariate linear regression: how and why? y 3 y 2 y 1 a intercept 0 0 x x x 1 2 3

17 Univariate linear regression: how and why? y 3 b slope y 2 1 y 1 a intercept 0 0 x x x 1 2 3

18 Univariate linear regression: how and why? y 3 b slope y 2 y 1 y y b= 2 1 x x 2 1 x x 2 1 y y a intercept 0 0 x x x 1 2 3

19 Univariate linear regression: how and why? y 3 b slope y 2 y 1 b= y y 2 1 x x 2 1 x x 2 1 y y y=a+bx a intercept 0 0 x x x 1 2 3

20 Univariate linear regression: how and why? 0 0

21 Univariate linear regression: how and why? 0 0

22 Univariate linear regression: how and why? r n r 1 r 3 r i r 2 0 0

23 Univariate linear regression: how and why? r n r 1 r 3 r i r 2 residuals r = y (a+bx ) i i i 0 0

24 Univariate linear regression: how and why? r n r 1 r 3 r i r 2 residuals r = y (a+bx ) i i i the line must minimize the sum of squared residuals 0 r 2+ r r n 0

25 Univariate linear regression: how and why? define the regression line y = â + ˆb x by minimizing the sum of squared residuals: (â, ˆb) = arg min (y i (a + b x i )) 2 (a,b) this is based on the model assumption that values a, b exist, such that, for all data points (x i, y i ) we have i y i = a + b x i + ε i, whereas all ε i are independent and normally distributed with the same variance σ 2.

26 Univariate linear regression: how and why? givend data: Y X y 1 x 1 y 2 x 2 y 3 x 3.. y n x n

27 Univariate linear regression: how and why? givend data: Y X y 1 x 1 y 2 x 2 y 3 x 3.. Model: there are values a, b, σ 2 such that y 1 = a + b x 1 + ε 1 y 2 = a + b x 2 + ε 2 y 3 = a + b x 3 + ε 3.. y n x n y n = a + b x n + ε n

28 Univariate linear regression: how and why? givend data: Y X y 1 x 1 y 2 x 2 y 3 x 3.. Model: there are values a, b, σ 2 such that y 1 = a + b x 1 + ε 1 y 2 = a + b x 2 + ε 2 y 3 = a + b x 3 + ε 3.. y n x n y n = a + b x n + ε n ε 1, ε 2,..., ε n are independent N (0, σ 2 ).

29 Univariate linear regression: how and why? givend data: Y X y 1 x 1 y 2 x 2 y 3 x 3.. Model: there are values a, b, σ 2 such that y 1 = a + b x 1 + ε 1 y 2 = a + b x 2 + ε 2 y 3 = a + b x 3 + ε 3.. y n x n y n = a + b x n + ε n ε 1, ε 2,..., ε n are independent N (0, σ 2 ). y 1, y 2,..., y n are independent y i N (a + b x i, σ 2 ).

30 Univariate linear regression: how and why? givend data: Y X y 1 x 1 y 2 x 2 y 3 x 3.. Model: there are values a, b, σ 2 such that y 1 = a + b x 1 + ε 1 y 2 = a + b x 2 + ε 2 y 3 = a + b x 3 + ε 3.. y n x n y n = a + b x n + ε n ε 1, ε 2,..., ε n are independent N (0, σ 2 ). y 1, y 2,..., y n are independent y i N (a + b x i, σ 2 ). a, b, σ 2 are unknown, but not random.

31 Univariate linear regression: how and why? We estimate a and b by computing (â, ˆb) := arg min (y i (a + b x i )) 2. (a,b) i

32 Univariate linear regression: how and why? We estimate a and b by computing (â, ˆb) := arg min (y i (a + b x i )) 2. (a,b) i Theorem Compute â and ˆb by i ˆb = (y i ȳ) (x i x) i (x = i x) 2 and â = ȳ ˆb x. i y i (x i x) i (x i x) 2

33 Univariate linear regression: how and why? We estimate a and b by computing (â, ˆb) := arg min (y i (a + b x i )) 2. (a,b) i Theorem Compute â and ˆb by i ˆb = (y i ȳ) (x i x) i (x = i x) 2 and â = ȳ ˆb x. i y i (x i x) i (x i x) 2 Please keep in mind: The line y = â + ˆb x goes through the center of gravity of the cloud of points (x 1, y 1 ), (x 2, y 2 ),..., (x n, y n ).

34 Univariate linear regression: how and why? Sketch of the proof of the theorem Let g(a, b) = i (y i (a + b x i )) 2. We optimize g, by setting the derivatives of g to 0 and obtain g(a, b) a g(a, b) b = i = i 2 (y i (a + bx i )) ( 1) 2 (y i (a + bx i )) ( x i ) 0 = i 0 = i (y i (â + ˆbx i )) ( 1) (y i (â + ˆbx i )) ( x i )

35 Univariate linear regression: how and why? 0 = i (y i (â + ˆbx i )) 0 = i (y i (â + ˆbx i )) x i gives us 0 = 0 = ( ) ( ) y i n â ˆb x i i ( ) ( ) ( y i x i â x i ˆb i i i i x 2 i ) and the theorem follows by solving this for â and ˆb.

36 Univariate linear regression: how and why? vulture day heartbpm metabol mintemp maxtemp medtemp / / / / / / / / / / / (14 different days)

37 Univariate linear regression: how and why? > model <- lm(metabol~heartbpm,data=vulture, subset=day=="17.05.") > summary(model) Call: lm(formula = metabol ~ heartbpm, data = vulture, subset = day == "17.05.") Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) e-08 *** heartbpm e-14 *** --- Signif. codes: 0 *** ** 0.01 * Residual standard error: on 17 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 1 and 17 DF, p-value: 2.979e-14

38 Univariate linear regression: how and why? Optimizing the clutch size Example:Cowpea weevil (also bruchid beetle) Callosobruchus maculatus German: Erbsensamenkäfer Wilson, K. (1994) Evolution of clutch size in insects. II. A test of static optimality models using the beetle Callosobruchus maculatus (Coleoptera: Bruchidae). Journal of Evolutionary Biology 7: How does survival probability depnend on clutch size?

39 Univariate linear regression: how and why? Optimizing the clutch size Example:Cowpea weevil (also bruchid beetle) Callosobruchus maculatus German: Erbsensamenkäfer Wilson, K. (1994) Evolution of clutch size in insects. II. A test of static optimality models using the beetle Callosobruchus maculatus (Coleoptera: Bruchidae). Journal of Evolutionary Biology 7: How does survival probability depnend on clutch size? Which clutch size optimizes the expected number of surviving offspring?

40 Univariate linear regression: how and why? viability clutchsize

41 Univariate linear regression: how and why? viability clutchsize

42 Univariate linear regression: how and why? clutchsize * viability clutchsize

43 Univariate linear regression: how and why? clutchsize * viability clutchsize

44 Contents t-test for linear regression 1 Univariate linear regression: how and why? 2 t-test for linear regression 3 log-scaling the data 4 Checking model assumptions 5 Linear regression example with scaling 6 Why it s called regression

45 t-test for linear regression Example: red deer (Cervus elaphus) theory: femals can influence the sex of their offspring

46 t-test for linear regression Example: red deer (Cervus elaphus) theory: femals can influence the sex of their offspring Evolutionary stable strategy: weak animals may tend to have female offspring, strong animals may tend to have male offspring. Clutton-Brock, T. H., Albon, S. D., Guinness, F. E. (1986) Great expectations: dominance, breeding success and offspring sex ratios in red deer. Anim. Behav. 34,

47 t-test for linear regression > hind rank ratiomales CAUTION: Simulated data, inspired by original paper

48 t-test for linear regression hind$rank hind$ratiomales

49 t-test for linear regression hind$rank hind$ratiomales

50 t-test for linear regression > mod <- lm(ratiomales~rank,data=hind) > summary(mod) Call: lm(formula = ratiomales ~ rank, data = hind) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) e-06 *** rank e-09 *** --- Signif. codes: 0 *** ** 0.01 * Residual standard error: on 52 degrees of freedom Multiple R-squared: , Adjusted R-squared:

51 t-test for linear regression Model: Y = a + b X + ε mit ε N (0, σ 2 )

52 t-test for linear regression Model: Y = a + b X + ε mit ε N (0, σ 2 ) How to compute the significance of a relationship between the explanatory trait X and the target variable Y?

53 t-test for linear regression Model: Y = a + b X + ε mit ε N (0, σ 2 ) How to compute the significance of a relationship between the explanatory trait X and the target variable Y? In other words: How can we test the null hypothesis b = 0?

54 t-test for linear regression Model: Y = a + b X + ε mit ε N (0, σ 2 ) How to compute the significance of a relationship between the explanatory trait X and the target variable Y? In other words: How can we test the null hypothesis b = 0? We have estimated b by ˆb 0. Could the true b be 0?

55 t-test for linear regression Model: Y = a + b X + ε mit ε N (0, σ 2 ) How to compute the significance of a relationship between the explanatory trait X and the target variable Y? In other words: How can we test the null hypothesis b = 0? We have estimated b by ˆb 0. Could the true b be 0? How large is the standard error of ˆb?

56 t-test for linear regression y i = a + b x i + ε mit ε N (0, σ 2 ) not random: a, b, x i, σ 2 random: ε, y i var(y i ) = var(a + b x i + ε) = var(ε) = σ 2 and y 1, y 2,..., y n are stochastically independent.

57 t-test for linear regression not random: a, b, x i, σ 2 y i = a + b x i + ε mit ε N (0, σ 2 ) random: ε, y i var(y i ) = var(a + b x i + ε) = var(ε) = σ 2 and y 1, y 2,..., y n are stochastically independent. ˆb = i y i(x i x) i (x i x) 2

58 t-test for linear regression not random: a, b, x i, σ 2 y i = a + b x i + ε mit ε N (0, σ 2 ) random: ε, y i var(y i ) = var(a + b x i + ε) = var(ε) = σ 2 and y 1, y 2,..., y n are stochastically independent. ˆb = i y i(x i x) i (x i x) 2 ( var(ˆb) = var i y ) i(x i x) i (x i x) 2 = var ( i y i(x i x)) ( i (x i x) 2 ) 2

59 t-test for linear regression not random: a, b, x i, σ 2 y i = a + b x i + ε mit ε N (0, σ 2 ) random: ε, y i var(y i ) = var(a + b x i + ε) = var(ε) = σ 2 and y 1, y 2,..., y n are stochastically independent. ˆb = i y i(x i x) i (x i x) 2 ( var(ˆb) = var i y ) i(x i x) i (x = var ( i y i(x i x)) i x) 2 ( i (x i x) 2 ) 2 i = var (y i) (x i x) 2 i ( i (x = σ 2 (x i x) 2 i x) 2 ) 2 ( i (x i x) 2 ) 2

60 t-test for linear regression not random: a, b, x i, σ 2 y i = a + b x i + ε mit ε N (0, σ 2 ) random: ε, y i var(y i ) = var(a + b x i + ε) = var(ε) = σ 2 and y 1, y 2,..., y n are stochastically independent. ˆb = i y i(x i x) i (x i x) 2 ( var(ˆb) = var i y ) i(x i x) i (x = var ( i y i(x i x)) i x) 2 ( i (x i x) 2 ) 2 i = var (y i) (x i x) 2 i ( i (x = σ 2 (x i x) 2 i x) 2 ) 2 ( i (x i x) 2 ) 2 = σ 2 / i (x i x) 2

61 t-test for linear regression In fact ˆb is normally distributed with mean b and var(ˆb) = σ 2 / i (x i x) 2

62 t-test for linear regression In fact ˆb is normally distributed with mean b and var(ˆb) = σ 2 / i (x i x) 2 Problem: We do not know σ 2.

63 t-test for linear regression In fact ˆb is normally distributed with mean b and var(ˆb) = σ 2 / i (x i x) 2 Problem: We do not know σ 2. We estimate σ 2 by considering the residual variance: s 2 := i (y i â ˆb ) 2 x i n 2

64 t-test for linear regression In fact ˆb is normally distributed with mean b and var(ˆb) = σ 2 / i (x i x) 2 Problem: We do not know σ 2. We estimate σ 2 by considering the residual variance: s 2 := i (y i â ˆb ) 2 x i n 2 Note that we divide by n 2. The reason for this is that two model parameters a and b have been estimated, which means that two degrees of freedom got lost.

65 t-test for linear regression var(ˆb) = σ 2 / i (x i x) 2 Estimate σ 2 by Then s 2 = i (y i â ˆb ) 2 x i. n 2 ˆb b / s i (x i x) 2 is Student-t-distributed with n 2 degrees of freedom and we can apply the t-test to test the null hypothesis b = 0.

66 Contents log-scaling the data 1 Univariate linear regression: how and why? 2 t-test for linear regression 3 log-scaling the data 4 Checking model assumptions 5 Linear regression example with scaling 6 Why it s called regression

67 log-scaling the data Data example: typical body weight [kg] and and brain weight [g] of 62 mammals species (and 3 dinosaurs) > data weight.kg. brain.weight.g species extinct african elephant no no no no asian elephant no no no no cat no chimpanzee no

68 log-scaling the data typische Werte bei 62 Saeugeierarten Gehirngewicht [g] Koerpergewicht [kg]

69 log-scaling the data typische Werte bei 65 Saeugeierarten african elephant asian elephant Gehirngewicht [g] 1e 01 1e+00 1e+01 1e+02 1e+03 mouse giraffe horse chimpanzee donkey cow rhesus monkey sheep jaguar pig potar monkey grey goat wolf kangaroo cat rabbit mountain beaver guinea pig mole rat hamster human Triceratops Diplodocus 1e 02 1e+00 1e+02 1e+04 Brachiosa Koerpergewicht [kg]

70 log-scaling the data 1e 02 1e+00 1e+02 1e+04 1e 01 1e+00 1e+01 1e+02 1e+03 typische Werte bei 65 Saeugeierarten Koerpergewicht [kg] Gehirngewicht [g]

71 log-scaling the data > modell <- lm(brain.weight.g~weight.kg.,subset=extinct=="no") > summary(modell) Call: lm(formula = brain.weight.g ~ weight.kg., subset = extinct == "no") Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) * weight.kg <2e-16 *** --- Signif. codes: 0 *** ** 0.01 * Residual standard error: on 60 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 1 and 60 DF, p-value: < 2.2e-16

72 log-scaling the data qqnorm(modell$residuals) Normal Q Q Plot Sample Quantiles Theoretical Quantiles

73 log-scaling the data plot(modell$fitted.values,modell$residuals) modell$residuals e 02 1e+00 1e+02 1e+04 modell$model$weight.kg.

74 log-scaling the data plot(modell$fitted.values,modell$residuals,log= x ) modell$residuals modell$fitted.values

75 log-scaling the data plot(modell$model$weight.kg.,modell$residuals) modell$residuals modell$model$weight.kg.

76 log-scaling the data plot(modell$model$weight.kg.,modell$residuals,log= x ) modell$residuals e 02 1e+00 1e+02 1e+04 modell$model$weight.kg.

77 log-scaling the data We see that the residuals varaince depends on the fitted values (or the body weight): heteroscadiscity

78 log-scaling the data We see that the residuals varaince depends on the fitted values (or the body weight): heteroscadiscity The model assumes homoscedascity, i.e. the random deviations must be (almost) independent of the explaining traits (body weight) and the fitted values.

79 log-scaling the data We see that the residuals varaince depends on the fitted values (or the body weight): heteroscadiscity The model assumes homoscedascity, i.e. the random deviations must be (almost) independent of the explaining traits (body weight) and the fitted values. variance-stabilizing transformation: can be rescale body- and brain size to make deviations independent of variables

80 log-scaling the data Actually not so surprising: An elephant s brain of typically 5 kg can easily be 500 g lighter or heavier from individual to individual. This can not happen for a mouse brain of typically 5 g. The latter will rather also vary by 10%, i.e. 0.5 g. Thus, the variance is not additive but rather multiplicative: brain mass = (expected brain mass) random We can convert this into something with additive randomness by taking the log: log(brain mass) = log(expected brain mass) + log(random)

81 log-scaling the data > logmodell <- lm(log(brain.weight.g)~log(weight.kg.),subset=e > summary(logmodell) Call: lm(formula = log(brain.weight.g) ~ log(weight.kg.), subset = e "no") Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) <2e-16 *** log(weight.kg.) <2e-16 *** --- Signif. codes: 0 *** ** 0.01 * Residual standard error: on 60 degrees of freedom Multiple R-squared: , Adjusted R-squared:

82 log-scaling the data qqnorm(modell$residuals) Normal Q Q Plot Theoretical Quantiles Sample Quantiles

83 log-scaling the data plot(logmodell$fitted.values,logmodell$residuals) logmodell$fitted.values logmodell$residuals

84 log-scaling the data plot(logmodell$fitted.values,logmodell$residuals,log= x ) 1e 03 1e 02 1e 01 1e+00 1e logmodell$fitted.values logmodell$residuals

85 log-scaling the data plot(weight.kg.[extinct== no ],logmodell$residuals) weight.kg.[extinct == "no"] logmodell$residuals

86 log-scaling the data plot(weight.kg.[extinct= no ],logmodell$residuals,log= x ) 1e 02 1e+00 1e+02 1e weight.kg.[extinct == "no"] logmodell$residuals

87 Contents Checking model assumptions 1 Univariate linear regression: how and why? 2 t-test for linear regression 3 log-scaling the data 4 Checking model assumptions 5 Linear regression example with scaling 6 Why it s called regression

88 Checking model assumptions Is the model appropriate for the data?, e.g Y = a + b X + ε mit ε N (0, σ 2 )

89 Checking model assumptions Is the model appropriate for the data?, e.g Y = a + b X + ε mit ε N (0, σ 2 ) If the model fits, the residuals must be ) y i (â + b xi look normally distributed and must not have obvious dependencies with X or â + b X.

90 Checking model assumptions Example: is the relation between X and Y sufficiently well described by the linear equation Y i = a + b X i + ε i? Y X

91 Checking model assumptions > mod <- lm(y ~ X) > summary(mod) Call: lm(formula = Y ~ X) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) X <2e-16 *** --- Signif. codes: 0 *** ** 0.01 *

92 Checking model assumptions > plot(x,residuals(mod)) residuals(mod) X

93 Checking model assumptions > plot(x,residuals(mod)) residuals(mod) X

94 Checking model assumptions > plot(x,residuals(mod)) residuals(mod) X Obviously, the residuals tend to be larger for very large and very small values of X than for mean values of X. That should not be!

95 Checking model assumptions Idea: Instead fit a section of a parabola instead of aline to (x i, y i ), i.e. a model of the form Y i = a + b X i + c X 2 i + ε i.

96 Checking model assumptions Idea: Instead fit a section of a parabola instead of aline to (x i, y i ), i.e. a model of the form Y i = a + b X i + c X 2 i + ε i. Is this still a linear model?

97 Checking model assumptions Idea: Instead fit a section of a parabola instead of aline to (x i, y i ), i.e. a model of the form Y i = a + b X i + c X 2 i + ε i. Is this still a linear model? Yes: Let Z = X 2, then Y is linear in X and Z.

98 Checking model assumptions Idea: Instead fit a section of a parabola instead of aline to (x i, y i ), i.e. a model of the form Y i = a + b X i + c X 2 i + ε i. Is this still a linear model? Yes: Let Z = X 2, then Y is linear in X and Z. In R: > Z <- X^2 > mod2 <- mod <- lm(y ~ X+Z)

99 Checking model assumptions > summary(mod2) Call: lm(formula = Y ~ X + Z) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) <2e-16 *** X * Z <2e-16 *** --- Signif. codes: 0 *** ** 0.01 *

100 Checking model assumptions For this model there is no obvious dependence between X and the residuals: plot(x,residuals(mod2)) residuals(mod2) X

101 Checking model assumptions Is the assumption of normality in the model Y i = a + b X i + ε i in accordance with the data?

102 Checking model assumptions Is the assumption of normality in the model Y i = a + b X i + ε i in accordance with the data? Are the residuals r i = Y i (â + b X i ) more or less normally distributed?

103 Checking model assumptions Is the assumption of normality in the model Y i = a + b X i + ε i in accordance with the data? Are the residuals r i = Y i (â + b X i ) more or less normally distributed? Graphical Methods: compare the theoretical quantiles of the standard normal distribution N (0, 1) with those of the residuals.

104 Checking model assumptions Is the assumption of normality in the model Y i = a + b X i + ε i in accordance with the data? Are the residuals r i = Y i (â + b X i ) more or less normally distributed? Graphical Methods: compare the theoretical quantiles of the standard normal distribution N (0, 1) with those of the residuals. Background: If we plot the quantiles of N (µ, σ 2 ) against those of N (0, 1), we obtain a line y(x) = µ + σ x.

105 Checking model assumptions Is the assumption of normality in the model Y i = a + b X i + ε i in accordance with the data? Are the residuals r i = Y i (â + b X i ) more or less normally distributed? Graphical Methods: compare the theoretical quantiles of the standard normal distribution N (0, 1) with those of the residuals. Background: If we plot the quantiles of N (µ, σ 2 ) against those of N (0, 1), we obtain a line y(x) = µ + σ x. (Reason: If X is standard-normally distributed and Y = a + b X, then Y is normally distributed with mean a and variance b 2.)

106 Checking model assumptions Before we fit the model with lm() we first have to check whether the model assumptions are fulfilled.

107 Checking model assumptions Before we fit the model with lm() we first have to check whether the model assumptions are fulfilled.

108 Checking model assumptions Before we fit the model with lm() we first have to check whether the model assumptions are fulfilled. To check the assumptions underlying a linear model we need the residuals.

109 Checking model assumptions Before we fit the model with lm() we first have to check whether the model assumptions are fulfilled. To check the assumptions underlying a linear model we need the residuals. To compute the residuals we first have to fit the model (in R with lm()).

110 Checking model assumptions Before we fit the model with lm() we first have to check whether the model assumptions are fulfilled. To check the assumptions underlying a linear model we need the residuals. To compute the residuals we first have to fit the model (in R with lm()). After that we can check the model assumptions and decide whether we stay with this model or still have to modify it.

111 Checking model assumptions p <- seq(from=0,to=1,by=0.01) plot(qnorm(p,mean=0,sd=1),qnorm(p,mean=1,sd=0.5), pch=16,cex=0.5) abline(v=0,h=0) qnorm(p, mean = 1, sd = 0.5) qnorm(p, mean = 0, sd = 1)

112 Checking model assumptions If we plot the empirical quantiles of a sample from a normal distribution against the theoretical quantiles of a standard normal distribution, the values are not precisely on the line but are scattered around a line.

113 Checking model assumptions If we plot the empirical quantiles of a sample from a normal distribution against the theoretical quantiles of a standard normal distribution, the values are not precisely on the line but are scattered around a line. If no systematic deviations from an imaginary line are recognizable: Normal distribution assumption is acceptable

114 Checking model assumptions If we plot the empirical quantiles of a sample from a normal distribution against the theoretical quantiles of a standard normal distribution, the values are not precisely on the line but are scattered around a line. If no systematic deviations from an imaginary line are recognizable: Normal distribution assumption is acceptable If systematic deviations from an imaginary line are obvious: Assumption of normality may be problematic. It may be necessary to rescale variables or to take additional explanatory variables into account.

115 Contents Linear regression example with scaling 1 Univariate linear regression: how and why? 2 t-test for linear regression 3 log-scaling the data 4 Checking model assumptions 5 Linear regression example with scaling 6 Why it s called regression

116 Linear regression example with scaling Data: For 301 US-american (Counties) number of white female inhabitants from 1960 and number of deaths by breast cancer in this group between 1950 and (Rice (2007) Mathematical Statistics and Data Analysis.) > canc deaths inhabitants

117 Linear regression example with scaling Is the average number of deaths proportional to population size, i.e. Edeaths = b inhabitants or does the cancer risk depend on the size of the county, such that a different model fits better? e.g. with a 0. Edeaths = a + b inhabitants

118 Linear regression example with scaling > modell <- lm(deaths~inhabitants,data=canc) > summary(modell) Call: lm(formula = deaths ~ inhabitants, data = canc) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) e e inhabitants 3.578e e <2e-16 *** --- Signif. codes: 0 *** ** 0.01 * Residual standard error: 13 on 299 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: 4315 on 1 and 299 DF, p-value: < 2.2e-16

119 Linear regression example with scaling The intercept is estimated to , but not significantly different from 0.

120 Linear regression example with scaling The intercept is estimated to , but not significantly different from 0. Thus we cannot reject the null hypothesis that the county size has no influence on the cancer risk.

121 Linear regression example with scaling The intercept is estimated to , but not significantly different from 0. Thus we cannot reject the null hypothesis that the county size has no influence on the cancer risk. But.. does the model fit?

122 Linear regression example with scaling qqnorm(modell$residuals) Normal Q Q Plot Sample Quantiles Theoretical Quantiles

123 Linear regression example with scaling plot(modell$fitted.values,modell$residuals) modell$residuals modell$fitted.values

124 Linear regression example with scaling plot(modell$fitted.values,modell$residuals,log= x ) modell$residuals modell$fitted.values

125 Linear regression example with scaling plot(canc$inhabitants,modell$residuals,log= x ) modell$residuals e+02 2e+03 5e+03 2e+04 5e+04 canc$inhabitants

126 Linear regression example with scaling Th variance of the residuals depends on the fitted values. Heteroscedasticity

127 Linear regression example with scaling Th variance of the residuals depends on the fitted values. Heteroscedasticity The linear model assumgs Homoscedasticity.

128 Linear regression example with scaling Th variance of the residuals depends on the fitted values. Heteroscedasticity The linear model assumgs Homoscedasticity. Variance Stabilizing Transformation: How can we rescale the population size such that we obtain homoscedastic data?

129 Linear regression example with scaling Where does the variance come from?

130 Linear regression example with scaling Where does the variance come from? If n is the number of white female inhabitants and p the individual probability to die by breast cancer within 10 years, then np is the expected number of deaths and the variance is n p (1 p) n p (Maybe approximate binomial by Poisson). Standard deviation: n p.

131 Linear regression example with scaling Where does the variance come from? If n is the number of white female inhabitants and p the individual probability to die by breast cancer within 10 years, then np is the expected number of deaths and the variance is n p (1 p) n p (Maybe approximate binomial by Poisson). Standard deviation: n p. In this case we can approximately stabilize variance by taking the root on both sides of the equation.

132 Linear regression example with scaling Explanation: y = b x + ε y = (b x + ε) 2 = b 2 x + 2 b x ε + ε 2 SD is not exactly proportional to x, but at least 2 b x ε has SD prop. to x, namely 2 b x σ. The Term ε 2 is the σ 2 -fold of a χ 2 1 -distributed random variable and has SD=σ2 2. If σ is small compared to b x, the approximation y b 2 x + 2 b x ε is reasonable and the SD of y is approximately proportional to x.

133 Linear regression example with scaling > modellsq <- lm(sqrt(deaths~sqrt(inhabitants),data=canc) > summary(modellsq) Call: lm(formula = sqrt(deaths) ~ sqrt(inhabitants), data = canc) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) sqrt(inhabitants) <2e-16 *** --- Signif. codes: 0 *** ** 0.01 * Residual standard error: on 299 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: 4051 on 1 and 299 DF, p-value: < 2.2e-16

134 Linear regression example with scaling qqnorm(modell$residuals) Normal Q Q Plot Sample Quantiles Theoretical Quantiles

135 Linear regression example with scaling plot(modellsq$fitted.values,modellsq$residuals,log= x ) plot(canc$inhabitants,modellsq$residuals,log= x ) modellsq$fitted.values modellsq$residuals 5e+02 2e+03 5e+03 2e+04 5e canc$inhabitants modellsq$residuals

136 Linear regression example with scaling The qqnorm plot is not perfect by at least the variance is stabilized.

137 Linear regression example with scaling The qqnorm plot is not perfect by at least the variance is stabilized. The result remains the same: No significant relation between county size and breast cancer death risk.

138 Contents Why it s called regression 1 Univariate linear regression: how and why? 2 t-test for linear regression 3 log-scaling the data 4 Checking model assumptions 5 Linear regression example with scaling 6 Why it s called regression

139 Why it s called regression Origin of the word Regression Sir Francis Galton ( ): Regression toward the mean.

140 Why it s called regression Origin of the word Regression Sir Francis Galton ( ): Regression toward the mean. Tall fathers tend to have sons that are slightly smaller than the fathers. Sons of small fathers are on average larger than their fathers.

141 Why it s called regression Koerpergroessen Sohn Vater

142 Why it s called regression Koerpergroessen Sohn Vater

143 Why it s called regression Koerpergroessen Sohn Vater

144 Why it s called regression Koerpergroessen Sohn Vater

145 Why it s called regression Koerpergroessen Sohn Vater

146 Why it s called regression Koerpergroessen Sohn Vater

147 Why it s called regression Koerpergroessen Sohn Vater

148 Why it s called regression Koerpergroessen Sohn Vater

149 Similar effects Why it s called regression In sports: The champion of the season will tend to fail the high expectations in the next year.

150 Why it s called regression Similar effects In sports: The champion of the season will tend to fail the high expectations in the next year. In school: If the worst 10% of the students get extra lessons and are not the worst 10% in the next year, then this does not proof that the extra lessons are useful.

Statistics for EES 7. Linear regression and linear models

Statistics for EES 7. Linear regression and linear models Dirk Metzler http://www.zi.biologie.uni-muenchen.de/evol/statgen.html 26. May 2009 Contents 1 Univariate linear regression: how and why? 2 t-test