Statistics for EES Linear regression and linear models

Size: px

Start display at page:

Download "Statistics for EES Linear regression and linear models"

Lisa Lee
5 years ago
Views:

1 Statstcs for EES Lnear regresson and lnear models Drk Metzler June 11, 2018 Contents 1 Unvarate lnear regresson: how and why? 1 2 t-test for lnear regresson 6 3 log-scalng the data 9 4 Checkng model assumptons 14 5 Lnear regresson example wth scalng 18 6 Why t s called regresson 23 1 Unvarate lnear regresson: how and why? References [1] Prnznger, R., E. Karl, R. Bögel, Ch. Walzer (1999): Energy metabolsm, body temperature, and cardac work n the Grffon vulture Gyps vulvus - telemetrc nvestgatons n the laboratory and n the feld.zoology 102, Suppl. II: 15 Data from Goethe-Unversty, Group of Prof. Prnznger Developed telemetrc system for measurng heart beats of flyng brds Important for ecologcal questons: metabolc rate. metabolc rate can only be measured n the lab can we nfer metabolc rate from heart beat frequency? 1

2 grffon vulture, , 16 degrees C metabolc rate [J/(g*h)] heart beats [per mnute] vulture day heartbpm metabol mntemp maxtemp medtemp / / / / / / / / / / / (14 dfferent days) > model <- lm(metabol~heartbpm,data=vulture, subset=day=="17.05.") > summary(model) Call: lm(formula = metabol ~ heartbpm, data = vulture, subset = day == "17.05.") Resduals: Mn 1Q Medan 3Q Max 2

3 Coeffcents: Estmate Std. Error t value Pr(> t ) (Intercept) e-08 *** heartbpm e-14 *** --- Sgnf. codes: 0 *** ** 0.01 * Resdual standard error: on 17 degrees of freedom Multple R-squared: ,Adjusted R-squared: F-statstc: on 1 and 17 DF, p-value: 2.979e-14 y 3 b slope y 2 y 1 b= y y 2 1 x x 2 1 x x 2 1 y y y=a+bx a ntercept 0 0 x x x r n r 1 r 3 r r 2 resduals r = y (a+bx ) the lne must mnmze the sum of squared resduals 0 r 2+ r r n 0 defne the regresson lne y = â + ˆb x 3

4 by mnmzng the sum of squared resduals: (â, ˆb) = arg mn (y (a + b x )) 2 (a,b) ths s based on the model assumpton that values a, b exst, such that, for all data ponts (x, y ) we have y = a + b x + ε, whereas all ε are ndependent and normally dstrbuted wth the same varance σ 2. gvend data: Y X Model: there are values a, b, σ 2 such that y 1 x 1 y 1 = a + b x 1 + ε 1 y 2 x 2 y 2 = a + b x 2 + ε 2 y 3 x 3 y 3 = a + b x 3 + ε y n x n y n = a + b x n + ε n ε 1, ε 2,..., ε n are ndependent N (0, σ 2 ).[1.5ex] y 1, y 2,..., y n are ndependent y N (a + b x, σ 2 ).[1.5ex] a, b, σ 2 are unknown, but not random. We estmate a and b by computng (â, ˆb) := arg mn (a,b) Theorem 1. Compute â and ˆb by ˆb = (y ȳ) (x x) (x x) 2 = and (y (a + b x )) 2. â = ȳ ˆb x. y (x x) (x x) 2 Please keep n mnd: The lne y = â + ˆb x goes through the center of gravty of the cloud of ponts (x 1, y 1 ), (x 2, y 2 ),..., (x n, y n ). Sketch of the proof of the theorem Let g(a, b) = (y (a + b x )) 2. We optmze g, by settng the dervatves of g g(a, b) a g(a, b) b = = 2 (y (a + bx )) ( 1) 2 (y (a + bx )) ( x ) 4

5 to 0 and obtan 0 = 0 = (y (â + ˆbx )) ( 1) (y (â + ˆbx )) ( x ) 0 = (y (â + ˆbx )) 0 = (y (â + ˆbx )) x gves us 0 = 0 = ( ) ( ) y n â ˆb x ( ) ( ) ( y x â x ˆb x 2 ) and the theorem follows by solvng ths for â and ˆb. Regresson and Correlaton If s x and s y are the bas-corrected (that s, computed wth n 1) standard devatons of the x and y values, and f cov(x, y) = 1 n 1 (x x) (y y) s the bas-corrected covarance, we obtan for the estmated slope of the regresson lne: b = (x 1 x) (y y) n 1 (x = (x x) (y y) x) 2 1 n 1 (x x) 2 = cov(x, y). s 2 x Thus, b s equal to the correlaton cor(x, y) = cov(x,y) s x s y f and only f s x = s y. Optmzng the clutch sze Example:Cowpea weevl (also bruchd beetle) Callosobruchus maculatus German: Erbsensamenkäfer 5

6 References [Wl94] Wlson, K. (1994) Evoluton of clutch sze n nsects. II. A test of statc optmalty models usng the beetle Callosobruchus maculatus (Coleoptera: Bruchdae). Journal of Evolutonary Bology 7: How does survval probablty depnend on clutch sze? Whch clutch sze optmzes the expected number of survvng offsprng? vablty clutchsze clutchsze * vablty clutchsze 2 t-test for lnear regresson Example: red deer (Cervus elaphus) theory: femals can nfluence the sex of ther offsprng 6

7 Evolutonary stable strategy: weak anmals may tend to have female offsprng, strong anmals may tend to have male offsprng. References [CAG86] Clutton-Brock, T. H., Albon, S. D., Gunness, F. E. (1986) Great expectatons: domnance, breedng success and offsprng sex ratos n red deer.anm. Behav. 34, > hnd rank ratomales CAUTION: Smulated data, nspred by orgnal paper hnd$ratomales hnd$rank 7

8 > mod <- lm(ratomales~rank,data=hnd) > summary(mod) Call: lm(formula = ratomales ~ rank, data = hnd) Resduals: Mn 1Q Medan 3Q Max Coeffcents: Estmate Std. Error t value Pr(> t ) (Intercept) e-06 *** rank e-09 *** --- Sgnf. codes: 0 *** ** 0.01 * Resdual standard error: on 52 degrees of freedom Multple R-squared: ,Adjusted R-squared: F-statstc: on 1 and 52 DF, p-value: 9.78e-09 Model: Y = a + b X + ε mt ε N (0, σ 2 ) [1.5ex] How to compute the sgnfcance of a relatonshp between the explanatory trat X and the target varable Y? [1.5ex] In other words: How can we test the null hypothess b = 0? [1.5ex] We have estmated b by ˆb 0. Could the true b be 0? [1.5ex] How large s the standard error of ˆb? not random: a, b, x, σ 2 y = a + b x + ε mt ε N (0, σ 2 ) random: ε, y var(y ) = var(a + b x + ε) = var(ε) = σ 2 and y 1, y 2,..., y n are stochastcally ndependent. ˆb = y (x x) (x x) 2 ( var(ˆb) = var y ) (x x) (x = var ( y (x x)) x) 2 ( (x x) 2 ) 2 = var (y ) (x x) 2 ( (x x) 2 ) 2 = σ 2 (x x) 2 ( (x x) 2 ) 2 = σ 2 / (x x) 2 8

9 In fact ˆb s normally dstrbuted wth mean b and var(ˆb) = σ 2 / (x x) 2 Problem: We do not know σ 2. We estmate σ 2 by consderng the resdual varance: s 2 := (y â ˆb ) 2 x n 2 Note that we dvde by n 2. The reason for ths s that two model parameters a and b have been estmated, whch means that two degrees of freedom got lost. var(ˆb) = σ 2 / (x x) 2 Estmate σ 2 by Then s 2 = (y â ˆb ) 2 x. n 2 ˆb b s / (x x) 2 s Student-t-dstrbuted wth n 2 degrees of freedom and we can apply the t-test to test the null hypothess b = 0. 3 log-scalng the data Data example: typcal body weght [kg] and and bran weght [g] of 62 mammals speces (and 3 dnosaurs) > data weght.kg. bran.weght.g speces extnct afrcan elephant no no no no asan elephant no no no no cat no 9

10 chmpanzee no Trceratops yes Brachosaurus yes typsche Werte be 65 Wrbelterarten asan afrcan elephant elephant Gehrngewcht [g] 1e 01 1e+00 1e+01 1e+02 1e+03 mouse human graffe horse chmpanzeecow donkey potar monkey grey goat wolf kangaroo cat rabbt mountan beaver gunea pg mole rat hamster rhesus monkey sheep jaguar pg 1e 02 1e+00 1e+02 1e+04 Koerpergewcht [kg] Brachosa Trceratops Dplodocus > modell <- lm(bran.weght.g~weght.kg.,subset=extnct=="no") > summary(modell) Call: lm(formula = bran.weght.g ~ weght.kg., subset = extnct == "no") Resduals: Mn 1Q Medan 3Q Max Coeffcents: Estmate Std. Error t value Pr(> t ) (Intercept) * weght.kg <2e-16 *** --- Sgnf. codes: 0 *** ** 0.01 * Resdual standard error: on 60 degrees of freedom Multple R-squared: ,Adjusted R-squared: F-statstc: on 1 and 60 DF, p-value: < 2.2e-16 qqnorm(modell$resduals) 10

11 Normal Q Q Plot Sample Quantles Theoretcal Quantles plot(modell$ftted.values,modell$resduals) plot(modell$ftted.values,modell$resduals,log= x ) modell$resduals modell$resduals modell$ftted.values modell$ftted.values plot(modell$model$weght.kg.,modell$resduals) plot(modell$model$weght.kg.,modell$resduals,log= x ) 11

12 modell$resduals modell$resduals modell$model$weght.kg. 1e 02 1e+00 1e+02 1e+04 modell$model$weght.kg. We see that the resduals varance depends on the ftted values (or the body weght): heteroscadscty The model assumes homoscedascty,.e. the random devatons must be (almost) ndependent of the explanng trats (body weght) and the ftted values. varance-stablzng transformaton: can be rescale body- and bran sze to make devatons ndependent of varables Actually not so surprsng: An elephant s bran of typcally 5 kg can easly be 500 g lghter or heaver from ndvdual to ndvdual. Ths can not happen for a mouse bran of typcally 5 g. The latter wll rather also vary by 10%,.e. 0.5 g. Thus, the varance s not addtve but rather multplcatve: bran mass = (expected bran mass) random We can convert ths nto somethng wth addtve randomness by takng the log: log(bran mass) = log(expected bran mass) + log(random) > logmodell <- lm(log(bran.weght.g)~log(weght.kg.),subset=extnct=="no") > summary(logmodell) Call: lm(formula = log(bran.weght.g) ~ log(weght.kg.), subset = extnct == "no") Resduals: Mn 1Q Medan 3Q Max

13 Coeffcents: Estmate Std. Error t value Pr(> t ) (Intercept) <2e-16 *** log(weght.kg.) <2e-16 *** --- Sgnf. codes: 0 *** ** 0.01 * Resdual standard error: on 60 degrees of freedom Multple R-squared: ,Adjusted R-squared: F-statstc: on 1 and 60 DF, p-value: < 2.2e-16 qqnorm(modell$resduals) Normal Q Q Plot Sample Quantles Theoretcal Quantles plot(logmodell$ftted.values,logmodell$resduals) plot(logmodell$ftted.values,logmodell$resduals,log= x ) 13

14 logmodell$ftted.values logmodell$resduals 1e 03 1e 02 1e 01 1e+00 1e logmodell$ftted.values logmodell$resduals plot(weght.kg.[extnct== no ],logmodell$resduals) plot(weght.kg.[extnct= no ],logmodell$resduals,log= x ) weght.kg.[extnct == "no"] logmodell$resduals 1e 02 1e+00 1e+02 1e weght.kg.[extnct == "no"] logmodell$resduals 4 Checkng model assumptons Is the model approprate for the data?, e.g Y = a + b X + ε mt ε N (0, σ 2 ) 14

15 If the model fts, the resduals must be y (â + b ) x look normally dstrbuted and must not have obvous dependences wth X or â + b X. Example: s the relaton between X and Y suffcently well descrbed by the lnear equaton Y = a + b X + ε? [-0.5cm] Y X > mod <- lm(y ~ X) > summary(mod) Call: lm(formula = Y ~ X) Resduals: Mn 1Q Medan 3Q Max Coeffcents: Estmate Std. Error t value Pr(> t ) (Intercept) X <2e-16 *** --- Sgnf. codes: 0 *** ** 0.01 * Resdual standard error: on 28 degrees of freedom Multple R-squared: ,Adjusted R-squared: F-statstc: on 1 and 28 DF, p-value: < 2.2e-16 15

16 > plot(x,resduals(mod)) [-0.5cm] resduals(mod) X [-0.5cm] Obvously, the resduals tend to be larger for very large and very small values of X than for mean values of X. That should not be! Idea: Instead ft a secton of a parabola nstead of alne to (x, y ),.e. a model of the form Y = a + b X + c X 2 + ε. Is ths stll a lnear model? Yes: Let Z = X 2, then Y s lnear n X and Z. In R: > Z <- X^2 > mod2 <- mod <- lm(y ~ X+Z) > summary(mod2) Call: lm(formula = Y ~ X + Z) Resduals: Mn 1Q Medan 3Q Max Coeffcents: Estmate Std. Error t value Pr(> t ) (Intercept) <2e-16 *** X * Z <2e-16 *** 16

17 --- Sgnf. codes: 0 *** ** 0.01 * Resdual standard error: on 27 degrees of freedom Multple R-squared: ,Adjusted R-squared: F-statstc: 7776 on 2 and 27 DF, p-value: < 2.2e-16 For ths model there s no obvous dependence between X and the resduals: plot(x,resduals(mod2)) [-0.5cm] resduals(mod2) X Is the assumpton of normalty n the model Y = a + b X + ε n accordance wth the data? Are the resduals r = Y (â + b X ) more or less normally dstrbuted? Graphcal Methods: compare the theoretcal quantles of the standard normal dstrbuton N (0, 1) wth those of the resduals. Background: If we plot the quantles of N (µ, σ 2 ) aganst those of N (0, 1), we obtan a lne y(x) = µ + σ x. (Reason: If X s standard-normally dstrbuted and Y = a + b X, then Y s normally dstrbuted wth mean a and varance b 2.) Before we ft the model wth lm() we frst have to check whether the model assumptons are fulflled. Before we ft the model wth lm() we frst have to check whether the model assumptons are fulflled. To check the assumptons underlyng a lnear model we need the resduals. To compute the resduals we frst have to ft the model (n R wth lm()). After that we can check the model assumptons and decde whether we stay wth ths model or stll have to modfy t. p <- seq(from=0,to=1,by=0.01) plot(qnorm(p,mean=0,sd=1),qnorm(p,mean=1,sd=0.5), 17

18 ablne(v=0,h=0) pch=16,cex=0.5) qnorm(p, mean = 1, sd = 0.5) qnorm(p, mean = 0, sd = 1) If we plot the emprcal quantles of a sample from a normal dstrbuton aganst the theoretcal quantles of a standard normal dstrbuton, the values are not precsely on the lne but are scattered around a lne. If no systematc devatons from an magnary lne are recognzable: Normal dstrbuton assumpton s acceptable If systematc devatons from an magnary lne are obvous: Assumpton of normalty may be problematc. It may be necessary to rescale varables or to take addtonal explanatory varables nto account. 5 Lnear regresson example wth scalng Data: For 301 US-amercan countes number of whte female nhabtants n certan age group n 1960 and number of deaths by breast cancer n ths group between 1950 and (Rce (2007) Mathematcal Statstcs and Data Analyss.) > canc deaths nhabtants

19 Is the average number of deaths proportonal to populaton sze,.e. Edeaths = b nhabtants or does the cancer rsk depend on the sze of the county, such that a dfferent model fts better? e.g. Edeaths = a + b nhabtants wth a 0. > modell <- lm(deaths~nhabtants,data=canc) > summary(modell) Call: lm(formula = deaths ~ nhabtants, data = canc) Resduals: Mn 1Q Medan 3Q Max Coeffcents: Estmate Std. Error t value Pr(> t ) (Intercept) e e nhabtants 3.578e e <2e-16 *** --- Sgnf. codes: 0 *** ** 0.01 * Resdual standard error: 13 on 299 degrees of freedom Multple R-squared: ,Adjusted R-squared: F-statstc: 4315 on 1 and 299 DF, p-value: < 2.2e-16 The ntercept s estmated to , but not sgnfcantly dfferent from 0. Thus we cannot reject the null hypothess that the county sze has no nfluence on the cancer rsk. 19

20 But.. does the model ft? Normal Q Q Plot Sample Quantles Theoretcal Quantles qqnorm(modell$resduals) plot(modell$ftted.values,modell$resduals) plot(modell$ftted.values,modell$resduals,log= x ) modell$resduals modell$resduals modell$ftted.values modell$ftted.values plot(canc$nhabtants,modell$resduals,log= x ) 20

21 modell$resduals e+02 2e+03 5e+03 2e+04 5e+04 canc$nhabtants The varance of the resduals depends on the ftted values. Heteroscedastcty The lnear model assumgs Homoscedastcty. Varance Stablzng Transformaton: How can we rescale the populaton sze such that we obtan homoscedastc data? Where does the varance come from? If n s the number of whte female nhabtants and p the ndvdual probablty to de by breast cancer wthn 10 years, then np s the expected number of deaths and the varance s n p (1 p) n p (Maybe approxmate bnomal by Posson). Standard devaton: n p. In ths case we can approxmately stablze varance by takng the root on both sdes of the equaton. Explanaton: y = b x + ε y = (b x + ε) 2 = b 2 x + 2 b x ε + ε 2 SD s not exactly proportonal to x, but at least 2 b x ε has SD prop. to x, namely 2 b x σ. The Term ε 2 s the σ 2 -fold of a χ 2 1-dstrbuted random varable and has SD=σ 2 2. If σ s small compared to b x, the approxmaton y b 2 x + 2 b x ε s reasonable and the SD of y s approxmately proportonal to x. 21

22 > modellsq <- lm(sqrt(deaths~sqrt(nhabtants),data=canc) > summary(modellsq) Call: lm(formula = sqrt(deaths) ~ sqrt(nhabtants), data = canc) Resduals: Mn 1Q Medan 3Q Max Coeffcents: Estmate Std. Error t value Pr(> t ) (Intercept) sqrt(nhabtants) <2e-16 *** --- Sgnf. codes: 0 *** ** 0.01 * Resdual standard error: on 299 degrees of freedom Multple R-squared: ,Adjusted R-squared: F-statstc: 4051 on 1 and 299 DF, p-value: < 2.2e-16 Normal Q Q Plot Sample Quantles Theoretcal Quantles qqnorm(modell$resduals) plot(modellsq$ftted.values,modellsq$resduals,log= x ) plot(canc$nhabtants,modell 22

23 modellsq$ftted.values modellsq$resduals 5e+02 2e+03 5e+03 2e+04 5e canc$nhabtants modellsq$resduals The qqnorm plot s not perfect by at least the varance s stablzed. The result remans the same: No sgnfcant relaton between county sze and breast cancer death rsk. 6 Why t s called regresson Orgn of the word Regresson Sr Francs Galton ( ): Regresson toward the mean. Tall fathers tend to have sons that are slghtly smaller than the fathers. 23

24 Sons of small fathers are on average larger than ther fathers. Koerpergroessen Sohn Vater Koerpergroessen Sohn Vater Koerpergroessen Sohn Vater 24

25 Smlar effects In sports: The champon of the season wll tend to fal the hgh expectatons n the next year. In school: If the worst 10% of the students get extra lessons and are not the worst 10% n the next year, then ths does not proof that the extra lessons are useful. Some of what you should be able to explan Model assumptons underlyng lnear regresson Equaton What s random, what s fxed? approach: mnmze sum of squared resduals optmal soluton for slope and ntercept slope vs. correlaton t-test for the slope (standard error, test statstc and df) scalng the data: when, why, how? qqnorm plots theory how to use them to judge model assumptons 25

Statistics for EES Linear regression and linear models

Statistics for EES Linear regression and linear models Dirk Metzler http://evol.bio.lmu.de/_statgen 28. July 2010 Contents 1 Univariate linear regression: how and why? 2 t-test for linear regression 3