Statistics for EES 7. Linear regression and linear models

Size: px

Start display at page:

Download "Statistics for EES 7. Linear regression and linear models"

Amberly Cummings
5 years ago
Views:

1 Statistics for EES 7. Linear regression and linear models Dirk Metzler May 2009

2 Contents 1 Univariate linear regression: how and why? 2 t-test for linear regression 3 log-scaling the data

3 Contents Univariate linear regression: how and why? 1 Univariate linear regression: how and why? 2 t-test for linear regression 3 log-scaling the data

4 Univariate linear regression: how and why? Griffon Vulture Gypus fulvus German: Ga nsegeier photo (c) by Jo rg Hempel

5 Univariate linear regression: how and why? Prinzinger, R., E. Karl, R. Bögel, Ch. Walzer (1999): Energy metabolism, body temperature, and cardiac work in the Griffon vulture Gyps vulvus - telemetric investigations in the laboratory and in the field. Zoology 102, Suppl. II: 15 Data from Goethe-University, Group of Prof. Prinzinger Developed telemetric system for measuring heart beats of flying birds

6 Univariate linear regression: how and why? Prinzinger, R., E. Karl, R. Bögel, Ch. Walzer (1999): Energy metabolism, body temperature, and cardiac work in the Griffon vulture Gyps vulvus - telemetric investigations in the laboratory and in the field. Zoology 102, Suppl. II: 15 Data from Goethe-University, Group of Prof. Prinzinger Developed telemetric system for measuring heart beats of flying birds Important for ecological questions: metabolic rate.

7 Univariate linear regression: how and why? Prinzinger, R., E. Karl, R. Bögel, Ch. Walzer (1999): Energy metabolism, body temperature, and cardiac work in the Griffon vulture Gyps vulvus - telemetric investigations in the laboratory and in the field. Zoology 102, Suppl. II: 15 Data from Goethe-University, Group of Prof. Prinzinger Developed telemetric system for measuring heart beats of flying birds Important for ecological questions: metabolic rate. metabolic rate can only be measured in the lab

8 Univariate linear regression: how and why? Prinzinger, R., E. Karl, R. Bögel, Ch. Walzer (1999): Energy metabolism, body temperature, and cardiac work in the Griffon vulture Gyps vulvus - telemetric investigations in the laboratory and in the field. Zoology 102, Suppl. II: 15 Data from Goethe-University, Group of Prof. Prinzinger Developed telemetric system for measuring heart beats of flying birds Important for ecological questions: metabolic rate. metabolic rate can only be measured in the lab can we infer metabolic rate from heart beat frequency?

9 Univariate linear regression: how and why? griffon vulture, , 16 degrees C metabolic rate [J/(g*h)] heart beats [per minute]

10 Univariate linear regression: how and why? griffon vulture, , 16 degrees C metabolic rate [J/(g*h)] heart beats [per minute]

11 Univariate linear regression: how and why? vulture day heartbpm metabol mintemp maxtemp medtemp / / / / / / / / / / / (14 different days)

12 Univariate linear regression: how and why? > model <- lm(metabol~heartbpm,data=vulture, subset=day=="17.05.") > summary(model) Call: lm(formula = metabol ~ heartbpm, data = vulture, subset = day "17.05.") Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) e-08 *** heartbpm e-14 *** --- Signif. codes: 0 *** ** 0.01 * Residual standard error: on 17 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 1 and 17 DF, p-value: 2.979e-14

13 Univariate linear regression: how and why? 0 0

14 Univariate linear regression: how and why? 0 0

15 Univariate linear regression: how and why? y 3 y 2 y x x x 1 2 3

16 Univariate linear regression: how and why? y 3 y 2 y 1 a intercept 0 0 x x x 1 2 3

17 Univariate linear regression: how and why? y 3 b slope y 2 1 y 1 a intercept 0 0 x x x 1 2 3

18 Univariate linear regression: how and why? y 3 b slope y 2 y 1 y y b= 2 1 x x 2 1 x x 2 1 y y a intercept 0 0 x x x 1 2 3

19 Univariate linear regression: how and why? y 3 b slope y 2 y 1 b= y y 2 1 x x 2 1 x x 2 1 y y y=a+bx a intercept 0 0 x x x 1 2 3

20 Univariate linear regression: how and why? 0 0

21 Univariate linear regression: how and why? 0 0

22 Univariate linear regression: how and why? r n r 1 r 3 r i r 2 0 0

23 Univariate linear regression: how and why? r n r 1 r 3 r i r 2 residuals r = y (a+bx ) i i i 0 0

24 Univariate linear regression: how and why? r n r 1 r 3 r i r 2 residuals r = y (a+bx ) i i i the line must minimize the sum of squared residuals 0 r 2+ r r n 0

25 Univariate linear regression: how and why? define the regression line y = â + ˆb x by minimizing the sum of squared residuals: (â, ˆb) = arg min (y i (a + b x i )) 2 (a,b) this is based on the model assumption that values a, b exist, such that, for all data points (x i, y i ) we have i y i = a + b x i + ε i, whereas all ε i are independent and normally distributed with the same variance σ 2.

26 Univariate linear regression: how and why? givend data: Y X y 1 x 1 y 2 x 2 y 3 x 3.. y n x n

27 Univariate linear regression: how and why? givend data: Y X y 1 x 1 y 2 x 2 y 3 x 3.. Model: there are values a, b, σ 2 such that y 1 = a + b x 1 + ε 1 y 2 = a + b x 2 + ε 2 y 3 = a + b x 3 + ε 3.. y n x n y n = a + b x n + ε n

28 Univariate linear regression: how and why? givend data: Y X y 1 x 1 y 2 x 2 y 3 x 3.. Model: there are values a, b, σ 2 such that y 1 = a + b x 1 + ε 1 y 2 = a + b x 2 + ε 2 y 3 = a + b x 3 + ε 3.. y n x n y n = a + b x n + ε n ε 1, ε 2,..., ε n are independent N (0, σ 2 ).

29 Univariate linear regression: how and why? givend data: Y X y 1 x 1 y 2 x 2 y 3 x 3.. Model: there are values a, b, σ 2 such that y 1 = a + b x 1 + ε 1 y 2 = a + b x 2 + ε 2 y 3 = a + b x 3 + ε 3.. y n x n y n = a + b x n + ε n ε 1, ε 2,..., ε n are independent N (0, σ 2 ). y 1, y 2,..., y n are independent y i N (a + b x i, σ 2 ).

30 Univariate linear regression: how and why? givend data: Y X y 1 x 1 y 2 x 2 y 3 x 3.. Model: there are values a, b, σ 2 such that y 1 = a + b x 1 + ε 1 y 2 = a + b x 2 + ε 2 y 3 = a + b x 3 + ε 3.. y n x n y n = a + b x n + ε n ε 1, ε 2,..., ε n are independent N (0, σ 2 ). y 1, y 2,..., y n are independent y i N (a + b x i, σ 2 ). a, b, σ 2 are unknown, but not random.

31 Univariate linear regression: how and why? We estimate a and b by computing (â, ˆb) := arg min (y i (a + b x i )) 2. (a,b) i

32 Univariate linear regression: how and why? We estimate a and b by computing (â, ˆb) := arg min (y i (a + b x i )) 2. (a,b) i Theorem Compute â and ˆb by i ˆb = (y i ȳ) (x i x) i (x = i x) 2 and â = ȳ ˆb x. i y i (x i x) i (x i x) 2

33 Univariate linear regression: how and why? We estimate a and b by computing (â, ˆb) := arg min (y i (a + b x i )) 2. (a,b) i Theorem Compute â and ˆb by i ˆb = (y i ȳ) (x i x) i (x = i x) 2 and â = ȳ ˆb x. i y i (x i x) i (x i x) 2 Please keep in mind: The line y = â + ˆb x goes through the center of gravity of the cloud of points (x 1, y 1 ), (x 2, y 2 ),..., (x n, y n ).

34 Univariate linear regression: how and why? Sketch of the proof of the theorem Let g(a, b) = i (y i (a + b x i )) 2. We optimize g, by setting the derivatives of g to 0 and obtain g(a, b) a g(a, b) b = i = i 2 (y i (a + bx i )) ( 1) 2 (y i (a + bx i )) ( x i ) 0 = i 0 = i (y i (â + ˆbx i )) ( 1) (y i (â + ˆbx i )) ( x i )

35 Univariate linear regression: how and why? 0 = i (y i (â + ˆbx i )) 0 = i (y i (â + ˆbx i )) x i gives us 0 = 0 = ( ) ( ) y i n â ˆb x i i ( ) ( ) ( y i x i â x i ˆb i i i i x 2 i ) and the theorem follows by solving this for â and ˆb.

36 Univariate linear regression: how and why? vulture day heartbpm metabol mintemp maxtemp medtemp / / / / / / / / / / / (14 different days)

37 Univariate linear regression: how and why? > model <- lm(metabol~heartbpm,data=vulture, subset=day=="17.05.") > summary(model) Call: lm(formula = metabol ~ heartbpm, data = vulture, subset = day == "17.05.") Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) e-08 *** heartbpm e-14 *** --- Signif. codes: 0 *** ** 0.01 * Residual standard error: on 17 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 1 and 17 DF, p-value: 2.979e-14

38 Univariate linear regression: how and why? Optimierung der Gelegegröße Example:Cowpea weevil (also bruchid beetle) Callosobruchus maculatus German: Erbsensamenkäfer Wilson, K. (1994) Evolution of clutch size in insects. II. A test of static optimality models using the beetle Callosobruchus maculatus (Coleoptera: Bruchidae). Journal of Evolutionary Biology 7: How does survival probability depnend on clutch size?

39 Univariate linear regression: how and why? Optimierung der Gelegegröße Example:Cowpea weevil (also bruchid beetle) Callosobruchus maculatus German: Erbsensamenkäfer Wilson, K. (1994) Evolution of clutch size in insects. II. A test of static optimality models using the beetle Callosobruchus maculatus (Coleoptera: Bruchidae). Journal of Evolutionary Biology 7: How does survival probability depnend on clutch size? Which clutch size optimizes the expected number of surviving offspring?

40 Univariate linear regression: how and why? viability clutchsize

41 Univariate linear regression: how and why? viability clutchsize

42 Univariate linear regression: how and why? clutchsize * viability clutchsize

43 Univariate linear regression: how and why? clutchsize * viability clutchsize

44 Contents t-test for linear regression 1 Univariate linear regression: how and why? 2 t-test for linear regression 3 log-scaling the data

45 t-test for linear regression Example: red deer (Cervus elaphus) theory: femals can influence the sex of their offspring

46 t-test for linear regression Example: red deer (Cervus elaphus) theory: femals can influence the sex of their offspring Evolutionary stable strategy: weak animals may tend to have female offspring, strong animals may tend to have male offspring. Clutton-Brock, T. H., Albon, S. D., Guinness, F. E. (1986) Great expectations: dominance, breeding success and offspring sex ratios in red deer. Anim. Behav. 34,

47 t-test for linear regression > hind rank ratiomales CAUTION: Simulated data, inspired by original paper

48 t-test for linear regression hind$rank hind$ratiomales

49 t-test for linear regression hind$rank hind$ratiomales

50 t-test for linear regression > mod <- lm(ratiomales~rank,data=hind) > summary(mod) Call: lm(formula = ratiomales ~ rank, data = hind) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) e-06 *** rank e-09 *** --- Signif. codes: 0 *** ** 0.01 * Residual standard error: on 52 degrees of freedom Multiple R-squared: , Adjusted R-squared:

51 t-test for linear regression Model: Y = a + b X + ε mit ε N (0, σ 2 )

52 t-test for linear regression Model: Y = a + b X + ε mit ε N (0, σ 2 ) How to compute the significance of a relationship between the explanatory trait X and the target variable Y?

53 t-test for linear regression Model: Y = a + b X + ε mit ε N (0, σ 2 ) How to compute the significance of a relationship between the explanatory trait X and the target variable Y? In other words: How can we test the null hypothesis b = 0?

54 t-test for linear regression Model: Y = a + b X + ε mit ε N (0, σ 2 ) How to compute the significance of a relationship between the explanatory trait X and the target variable Y? In other words: How can we test the null hypothesis b = 0? We have estimated b by ˆb 0. Could the true b be 0?

55 t-test for linear regression Model: Y = a + b X + ε mit ε N (0, σ 2 ) How to compute the significance of a relationship between the explanatory trait X and the target variable Y? In other words: How can we test the null hypothesis b = 0? We have estimated b by ˆb 0. Could the true b be 0? How large is the standard error of ˆb?

56 t-test for linear regression y i = a + b x i + ε mit ε N (0, σ 2 ) not random: a, b, x i, σ 2 random: ε, y i var(y i ) = var(a + b x i + ε) = var(ε) = σ 2 and y 1, y 2,..., y n are stochastically independent.

57 t-test for linear regression not random: a, b, x i, σ 2 y i = a + b x i + ε mit ε N (0, σ 2 ) random: ε, y i var(y i ) = var(a + b x i + ε) = var(ε) = σ 2 and y 1, y 2,..., y n are stochastically independent. ˆb = i y i(x i x) i (x i x) 2

58 t-test for linear regression not random: a, b, x i, σ 2 y i = a + b x i + ε mit ε N (0, σ 2 ) random: ε, y i var(y i ) = var(a + b x i + ε) = var(ε) = σ 2 and y 1, y 2,..., y n are stochastically independent. ˆb = i y i(x i x) i (x i x) 2 ( var(ˆb) = var i y ) i(x i x) i (x i x) 2 = var ( i y i(x i x)) ( i (x i x) 2 ) 2

59 t-test for linear regression not random: a, b, x i, σ 2 y i = a + b x i + ε mit ε N (0, σ 2 ) random: ε, y i var(y i ) = var(a + b x i + ε) = var(ε) = σ 2 and y 1, y 2,..., y n are stochastically independent. ˆb = i y i(x i x) i (x i x) 2 ( var(ˆb) = var i y ) i(x i x) i (x = var ( i y i(x i x)) i x) 2 ( i (x i x) 2 ) 2 i = var (y i) (x i x) 2 i ( i (x = σ 2 (x i x) 2 i x) 2 ) 2 ( i (x i x) 2 ) 2

60 t-test for linear regression not random: a, b, x i, σ 2 y i = a + b x i + ε mit ε N (0, σ 2 ) random: ε, y i var(y i ) = var(a + b x i + ε) = var(ε) = σ 2 and y 1, y 2,..., y n are stochastically independent. ˆb = i y i(x i x) i (x i x) 2 ( var(ˆb) = var i y ) i(x i x) i (x = var ( i y i(x i x)) i x) 2 ( i (x i x) 2 ) 2 i = var (y i) (x i x) 2 i ( i (x = σ 2 (x i x) 2 i x) 2 ) 2 ( i (x i x) 2 ) 2 = σ 2 / i (x i x) 2

61 t-test for linear regression In fact ˆb is normally distributed with mean b and var(ˆb) = σ 2 / i (x i x) 2

62 t-test for linear regression In fact ˆb is normally distributed with mean b and var(ˆb) = σ 2 / i (x i x) 2 Problem: We do not know σ 2.

63 t-test for linear regression In fact ˆb is normally distributed with mean b and var(ˆb) = σ 2 / i (x i x) 2 Problem: We do not know σ 2. We estimate σ 2 by considering the residual variance: s 2 := i (y i â ˆb ) 2 x i n 2

64 t-test for linear regression In fact ˆb is normally distributed with mean b and var(ˆb) = σ 2 / i (x i x) 2 Problem: We do not know σ 2. We estimate σ 2 by considering the residual variance: s 2 := i (y i â ˆb ) 2 x i n 2 Note that we divide by n 2. The reason for this is that two model parameters a and b have been estimated, which means that two degrees of freedom got lost.

65 t-test for linear regression var(ˆb) = σ 2 / i (x i x) 2 Estimate σ 2 by Then s 2 = i (y i â ˆb ) 2 x i. n 2 ˆb b / s i (x i x) 2 is Student-t-distributed with n 2 degrees of freedom and we can apply the t-test to test the null hypothesis b = 0.

66 Contents log-scaling the data 1 Univariate linear regression: how and why? 2 t-test for linear regression 3 log-scaling the data

67 log-scaling the data Data example: typical body weight [kg] and and brain weight [g] of 62 mammals species (and 3 dinosaurs) > data weight.kg. brain.weight.g species extinct african elephant no no no no asian elephant no no no no cat no chimpanzee no

68 log-scaling the data typische Werte bei 62 Saeugeierarten Gehirngewicht [g] Koerpergewicht [kg]

69 log-scaling the data typische Werte bei 65 Saeugeierarten african elephant asian elephant Gehirngewicht [g] 1e 01 1e+00 1e+01 1e+02 1e+03 mouse giraffe horse chimpanzee donkey cow rhesus monkey sheep jaguar pig potar monkey grey goat wolf kangaroo cat rabbit mountain beaver guinea pig mole rat hamster human Triceratops Diplodocus 1e 02 1e+00 1e+02 1e+04 Brachiosa Koerpergewicht [kg]

70 log-scaling the data 1e 02 1e+00 1e+02 1e+04 1e 01 1e+00 1e+01 1e+02 1e+03 typische Werte bei 65 Saeugeierarten Koerpergewicht [kg] Gehirngewicht [g]

71 log-scaling the data > modell <- lm(brain.weight.g~weight.kg.,subset=extinct=="no") > summary(modell) Call: lm(formula = brain.weight.g ~ weight.kg., subset = extinct == "no") Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) * weight.kg <2e-16 *** --- Signif. codes: 0 *** ** 0.01 * Residual standard error: on 60 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 1 and 60 DF, p-value: < 2.2e-16

72 log-scaling the data qqnorm(modell$residuals) Normal Q Q Plot Sample Quantiles Theoretical Quantiles

73 log-scaling the data plot(modell$fitted.values,modell$residuals) modell$residuals e 02 1e+00 1e+02 1e+04 modell$model$weight.kg.

74 log-scaling the data plot(modell$fitted.values,modell$residuals,log= x ) modell$residuals modell$fitted.values

75 log-scaling the data plot(modell$model$weight.kg.,modell$residuals) modell$residuals modell$model$weight.kg.

76 log-scaling the data plot(modell$model$weight.kg.,modell$residuals,log= x ) modell$residuals e 02 1e+00 1e+02 1e+04 modell$model$weight.kg.

77 log-scaling the data We see that the residuals varaince depends on the fitted values (or the body weight): heteroscadiscity

78 log-scaling the data We see that the residuals varaince depends on the fitted values (or the body weight): heteroscadiscity The model assumes homoscedascity, i.e. the random deviations must be (almost) independent of the explaining traits (body weight) and the fitted values.

79 log-scaling the data We see that the residuals varaince depends on the fitted values (or the body weight): heteroscadiscity The model assumes homoscedascity, i.e. the random deviations must be (almost) independent of the explaining traits (body weight) and the fitted values. variance-stabilizing transformation: can be rescale body- and brain size to make deviations independent of variables

80 log-scaling the data Actually not so surprising: An elephant s brain of typically 5 kg can easily be 500 g lighter or heavier from individual to individual. This can not happen for a mouse brain of typically 5 g. The latter will rather also vary by 10%, i.e. 0.5 g. Thus, the variance is not additive but rather multiplicative: brain mass = (expected brain mass) random We can convert this into something with additive randomness by taking the log: log(brain mass) = log(expected brain mass) + log(random)

81 log-scaling the data > logmodell <- lm(log(brain.weight.g)~log(weight.kg.),subset=e > summary(logmodell) Call: lm(formula = log(brain.weight.g) ~ log(weight.kg.), subset = e "no") Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) <2e-16 *** log(weight.kg.) <2e-16 *** --- Signif. codes: 0 *** ** 0.01 * Residual standard error: on 60 degrees of freedom Multiple R-squared: , Adjusted R-squared:

82 log-scaling the data qqnorm(modell$residuals) Normal Q Q Plot Theoretical Quantiles Sample Quantiles

83 log-scaling the data plot(logmodell$fitted.values,logmodell$residuals) logmodell$fitted.values logmodell$residuals

84 log-scaling the data plot(logmodell$fitted.values,logmodell$residuals,log= x ) 1e 03 1e 02 1e 01 1e+00 1e logmodell$fitted.values logmodell$residuals

85 log-scaling the data plot(weight.kg.[extinct== no ],logmodell$residuals) weight.kg.[extinct == "no"] logmodell$residuals

86 log-scaling the data plot(weight.kg.[extinct= no ],logmodell$residuals,log= x ) 1e 02 1e+00 1e+02 1e weight.kg.[extinct == "no"] logmodell$residuals

Statistics for EES Linear regression and linear models

Statistics for EES Linear regression and linear models Dirk Metzler http://evol.bio.lmu.de/_statgen 28. July 2010 Contents 1 Univariate linear regression: how and why? 2 t-test for linear regression 3