Statistics for EES 7. Linear regression and linear models

Similar documents
Statistics for EES Linear regression and linear models

Multivariate Statistics in Ecology and Quantitative Genetics Linear Regression and Linear Models

Statistics for EES Linear regression and linear models

Multivariate Statistics in Ecology and Quantitative Genetics Summary

Lecture 16: Again on Regression

Minimum Regularized Covariance Determinant Estimator

PHAR2821 Drug Discovery and Design B Statistics GARTH TARR SEMESTER 2, 2013

6.0 Lesson Plan. Answer Questions. Regression. Transformation. Extrapolation. Residuals

Stat 529 (Winter 2011) A simple linear regression (SLR) case study. Mammals brain weights and body weights

MATH 644: Regression Analysis Methods

Linear Regression Model. Badr Missaoui

Lecture 18: Simple Linear Regression

18.0 Multiple and Nonlinear Regression

Regression and Models with Multiple Factors. Ch. 17, 18

ST430 Exam 1 with Answers

Distribution Assumptions

Linear Regression. In this lecture we will study a particular type of regression model: the linear regression model

Inference for Regression

Ch 2: Simple Linear Regression

Stat 101: Lecture 6. Summer 2006

Study Sheet. December 10, The course PDF has been updated (6/11). Read the new one.

1 Multiple Regression

SCHOOL OF MATHEMATICS AND STATISTICS

Stat 411/511 ESTIMATING THE SLOPE AND INTERCEPT. Charlotte Wickham. stat511.cwick.co.nz. Nov

Unit 6 - Introduction to linear regression

IES 612/STA 4-573/STA Winter 2008 Week 1--IES 612-STA STA doc

Lecture 4 Multiple linear regression

Stat 5102 Final Exam May 14, 2015

Statistics and Quantitative Analysis U4320. Segment 10 Prof. Sharyn O Halloran

Section Review. Change Over Time UNDERSTANDING CONCEPTS. of evolution? share ancestors? CRITICAL THINKING

Multiple Linear Regression

Stat 412/512 REVIEW OF SIMPLE LINEAR REGRESSION. Jan Charlotte Wickham. stat512.cwick.co.nz

Correlation and Regression

Booklet of Code and Output for STAC32 Final Exam

GROUPED DATA E.G. FOR SAMPLE OF RAW DATA (E.G. 4, 12, 7, 5, MEAN G x / n STANDARD DEVIATION MEDIAN AND QUARTILES STANDARD DEVIATION

Interpretation, Prediction and Confidence Intervals

Nature vs. nurture? Lecture 18 - Regression: Inference, Outliers, and Intervals. Regression Output. Conditions for inference.

Inferences on Linear Combinations of Coefficients

14 Multiple Linear Regression

Transformations. Merlise Clyde. Readings: Gelman & Hill Ch 2-4, ALR 8-9

UNIVERSITY OF MASSACHUSETTS. Department of Mathematics and Statistics. Basic Exam - Applied Statistics. Tuesday, January 17, 2017

Announcements: You can turn in homework until 6pm, slot on wall across from 2202 Bren. Make sure you use the correct slot! (Stats 8, closest to wall)

Evidence of Common Ancestry Stations

Problems. Suppose both models are fitted to the same data. Show that SS Res, A SS Res, B

Tests of Linear Restrictions

Generalised linear models. Response variable can take a number of different formats

Comparing Nested Models

Measuring the fit of the model - SSR

Biostatistics for physicists fall Correlation Linear regression Analysis of variance

MS&E 226: Small Data

Biostatistics. Correlation and linear regression. Burkhardt Seifert & Alois Tschopp. Biostatistics Unit University of Zurich

Lecture 15. Hypothesis testing in the linear model

Multiple Linear Regression. Chapter 12

Handout 4: Simple Linear Regression

Regression Models - Introduction

Evidence of Evolution by Natural Selection. Dodo bird

unadjusted model for baseline cholesterol 22:31 Monday, April 19,

13 Simple Linear Regression

A discussion on multiple regression models

Density Temp vs Ratio. temp

22s:152 Applied Linear Regression

Variance Decomposition and Goodness of Fit

Unit 6 - Simple linear regression

Applied Statistics. J. Blanchet and J. Wadsworth. Institute of Mathematics, Analysis, and Applications EPF Lausanne

Regression. Bret Hanlon and Bret Larget. December 8 15, Department of Statistics University of Wisconsin Madison.

Part II { Oneway Anova, Simple Linear Regression and ANCOVA with R

STAT 215 Confidence and Prediction Intervals in Regression

Coefficient of Determination

Linear Model Specification in R

Lecture 2. Simple linear regression

GMM - Generalized method of moments

SCHOOL OF MATHEMATICS AND STATISTICS. Linear and Generalised Linear Models

Business Statistics. Lecture 10: Course Review

Introduction and Single Predictor Regression. Correlation

Oct Simple linear regression. Minimum mean square error prediction. Univariate. regression. Calculating intercept and slope

Regression, Part I. - In correlation, it would be irrelevant if we changed the axes on our graph.

Statistical View of Least Squares

Quantitative Understanding in Biology Module II: Model Parameter Estimation Lecture I: Linear Correlation and Regression

Example. Multiple Regression. Review of ANOVA & Simple Regression /749 Experimental Design for Behavioral and Social Sciences

Data Analysis Using R ASC & OIR

Multiple Regression and Regression Model Adequacy

Properties of Estimators

SLR output RLS. Refer to slr (code) on the Lecture Page of the class website.

Statistics for EES 3. From standard error to t-test

ST430 Exam 2 Solutions

Statistics 203 Introduction to Regression Models and ANOVA Practice Exam

Introduction to Linear Regression

STAT 350: Summer Semester Midterm 1: Solutions

Multiple Regression Part I STAT315, 19-20/3/2014

ACOVA and Interactions

Variance Decomposition in Regression James M. Murray, Ph.D. University of Wisconsin - La Crosse Updated: October 04, 2017

1.) Fit the full model, i.e., allow for separate regression lines (different slopes and intercepts) for each species

Applied Regression Analysis

Chapter 1: Linear Regression with One Predictor Variable also known as: Simple Linear Regression Bivariate Linear Regression

Week 7 Multiple factors. Ch , Some miscellaneous parts

Lecture 6: Linear Regression

BIOL 458 BIOMETRY Lab 9 - Correlation and Bivariate Regression

Consider fitting a model using ordinary least squares (OLS) regression:

The First Thing You Ever Do When Receive a Set of Data Is

Example: Data from the Child Health and Development Study

Transcription:

Statistics for EES 7. Linear regression and linear models Dirk Metzler http://www.zi.biologie.uni-muenchen.de/evol/statgen.html 26. May 2009

Contents 1 Univariate linear regression: how and why? 2 t-test for linear regression 3 log-scaling the data

Contents Univariate linear regression: how and why? 1 Univariate linear regression: how and why? 2 t-test for linear regression 3 log-scaling the data

Univariate linear regression: how and why? Griffon Vulture Gypus fulvus German: Ga nsegeier photo (c) by Jo rg Hempel

Univariate linear regression: how and why? Prinzinger, R., E. Karl, R. Bögel, Ch. Walzer (1999): Energy metabolism, body temperature, and cardiac work in the Griffon vulture Gyps vulvus - telemetric investigations in the laboratory and in the field. Zoology 102, Suppl. II: 15 Data from Goethe-University, Group of Prof. Prinzinger Developed telemetric system for measuring heart beats of flying birds

Univariate linear regression: how and why? Prinzinger, R., E. Karl, R. Bögel, Ch. Walzer (1999): Energy metabolism, body temperature, and cardiac work in the Griffon vulture Gyps vulvus - telemetric investigations in the laboratory and in the field. Zoology 102, Suppl. II: 15 Data from Goethe-University, Group of Prof. Prinzinger Developed telemetric system for measuring heart beats of flying birds Important for ecological questions: metabolic rate.

Univariate linear regression: how and why? Prinzinger, R., E. Karl, R. Bögel, Ch. Walzer (1999): Energy metabolism, body temperature, and cardiac work in the Griffon vulture Gyps vulvus - telemetric investigations in the laboratory and in the field. Zoology 102, Suppl. II: 15 Data from Goethe-University, Group of Prof. Prinzinger Developed telemetric system for measuring heart beats of flying birds Important for ecological questions: metabolic rate. metabolic rate can only be measured in the lab

Univariate linear regression: how and why? Prinzinger, R., E. Karl, R. Bögel, Ch. Walzer (1999): Energy metabolism, body temperature, and cardiac work in the Griffon vulture Gyps vulvus - telemetric investigations in the laboratory and in the field. Zoology 102, Suppl. II: 15 Data from Goethe-University, Group of Prof. Prinzinger Developed telemetric system for measuring heart beats of flying birds Important for ecological questions: metabolic rate. metabolic rate can only be measured in the lab can we infer metabolic rate from heart beat frequency?

Univariate linear regression: how and why? griffon vulture, 17.05.99, 16 degrees C metabolic rate [J/(g*h)] 0 5 10 15 20 25 30 50 60 70 80 90 100 heart beats [per minute]

Univariate linear regression: how and why? griffon vulture, 17.05.99, 16 degrees C metabolic rate [J/(g*h)] 0 5 10 15 20 25 30 50 60 70 80 90 100 heart beats [per minute]

Univariate linear regression: how and why? vulture day heartbpm metabol mintemp maxtemp medtemp 1 01.04./02.04. 70.28 11.51-6 2-2.0 2 01.04./02.04. 66.13 11.07-6 2-2.0 3 01.04./02.04. 58.32 10.56-6 2-2.0 4 01.04./02.04. 58.63 10.62-6 2-2.0 5 01.04./02.04. 58.05 9.52-6 2-2.0 6 01.04./02.04. 66.37 7.19-6 2-2.0 7 01.04./02.04. 62.43 8.78-6 2-2.0 8 01.04./02.04. 65.83 8.24-6 2-2.0 9 01.04./02.04. 47.90 7.47-6 2-2.0 10 01.04./02.04. 51.29 7.83-6 2-2.0 11 01.04./02.04. 57.20 9.18-6 2-2.0..................... (14 different days)

Univariate linear regression: how and why? > model <- lm(metabol~heartbpm,data=vulture, subset=day=="17.05.") > summary(model) Call: lm(formula = metabol ~ heartbpm, data = vulture, subset = day "17.05.") Residuals: Min 1Q Median 3Q Max -2.2026-0.2555 0.1005 0.6393 1.1834 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) -7.73522 0.84543-9.149 5.60e-08 *** heartbpm 0.27771 0.01207 23.016 2.98e-14 *** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual standard error: 0.912 on 17 degrees of freedom Multiple R-squared: 0.9689, Adjusted R-squared: 0.9671 F-statistic: 529.7 on 1 and 17 DF, p-value: 2.979e-14

Univariate linear regression: how and why? 0 0

Univariate linear regression: how and why? 0 0

Univariate linear regression: how and why? y 3 y 2 y 1 0 0 x x x 1 2 3

Univariate linear regression: how and why? y 3 y 2 y 1 a intercept 0 0 x x x 1 2 3

Univariate linear regression: how and why? y 3 b slope y 2 1 y 1 a intercept 0 0 x x x 1 2 3

Univariate linear regression: how and why? y 3 b slope y 2 y 1 y y b= 2 1 x x 2 1 x x 2 1 y y 2 1 1 a intercept 0 0 x x x 1 2 3

Univariate linear regression: how and why? y 3 b slope y 2 y 1 b= y y 2 1 x x 2 1 x x 2 1 y y 2 1 1 y=a+bx a intercept 0 0 x x x 1 2 3

Univariate linear regression: how and why? 0 0

Univariate linear regression: how and why? 0 0

Univariate linear regression: how and why? r n r 1 r 3 r i r 2 0 0

Univariate linear regression: how and why? r n r 1 r 3 r i r 2 residuals r = y (a+bx ) i i i 0 0

Univariate linear regression: how and why? r n r 1 r 3 r i r 2 residuals r = y (a+bx ) i i i the line must minimize the sum of squared residuals 0 r 2+ r 2+... + r 2 1 2 n 0

Univariate linear regression: how and why? define the regression line y = â + ˆb x by minimizing the sum of squared residuals: (â, ˆb) = arg min (y i (a + b x i )) 2 (a,b) this is based on the model assumption that values a, b exist, such that, for all data points (x i, y i ) we have i y i = a + b x i + ε i, whereas all ε i are independent and normally distributed with the same variance σ 2.

Univariate linear regression: how and why? givend data: Y X y 1 x 1 y 2 x 2 y 3 x 3.. y n x n

Univariate linear regression: how and why? givend data: Y X y 1 x 1 y 2 x 2 y 3 x 3.. Model: there are values a, b, σ 2 such that y 1 = a + b x 1 + ε 1 y 2 = a + b x 2 + ε 2 y 3 = a + b x 3 + ε 3.. y n x n y n = a + b x n + ε n

Univariate linear regression: how and why? givend data: Y X y 1 x 1 y 2 x 2 y 3 x 3.. Model: there are values a, b, σ 2 such that y 1 = a + b x 1 + ε 1 y 2 = a + b x 2 + ε 2 y 3 = a + b x 3 + ε 3.. y n x n y n = a + b x n + ε n ε 1, ε 2,..., ε n are independent N (0, σ 2 ).

Univariate linear regression: how and why? givend data: Y X y 1 x 1 y 2 x 2 y 3 x 3.. Model: there are values a, b, σ 2 such that y 1 = a + b x 1 + ε 1 y 2 = a + b x 2 + ε 2 y 3 = a + b x 3 + ε 3.. y n x n y n = a + b x n + ε n ε 1, ε 2,..., ε n are independent N (0, σ 2 ). y 1, y 2,..., y n are independent y i N (a + b x i, σ 2 ).

Univariate linear regression: how and why? givend data: Y X y 1 x 1 y 2 x 2 y 3 x 3.. Model: there are values a, b, σ 2 such that y 1 = a + b x 1 + ε 1 y 2 = a + b x 2 + ε 2 y 3 = a + b x 3 + ε 3.. y n x n y n = a + b x n + ε n ε 1, ε 2,..., ε n are independent N (0, σ 2 ). y 1, y 2,..., y n are independent y i N (a + b x i, σ 2 ). a, b, σ 2 are unknown, but not random.

Univariate linear regression: how and why? We estimate a and b by computing (â, ˆb) := arg min (y i (a + b x i )) 2. (a,b) i

Univariate linear regression: how and why? We estimate a and b by computing (â, ˆb) := arg min (y i (a + b x i )) 2. (a,b) i Theorem Compute â and ˆb by i ˆb = (y i ȳ) (x i x) i (x = i x) 2 and â = ȳ ˆb x. i y i (x i x) i (x i x) 2

Univariate linear regression: how and why? We estimate a and b by computing (â, ˆb) := arg min (y i (a + b x i )) 2. (a,b) i Theorem Compute â and ˆb by i ˆb = (y i ȳ) (x i x) i (x = i x) 2 and â = ȳ ˆb x. i y i (x i x) i (x i x) 2 Please keep in mind: The line y = â + ˆb x goes through the center of gravity of the cloud of points (x 1, y 1 ), (x 2, y 2 ),..., (x n, y n ).

Univariate linear regression: how and why? Sketch of the proof of the theorem Let g(a, b) = i (y i (a + b x i )) 2. We optimize g, by setting the derivatives of g to 0 and obtain g(a, b) a g(a, b) b = i = i 2 (y i (a + bx i )) ( 1) 2 (y i (a + bx i )) ( x i ) 0 = i 0 = i (y i (â + ˆbx i )) ( 1) (y i (â + ˆbx i )) ( x i )

Univariate linear regression: how and why? 0 = i (y i (â + ˆbx i )) 0 = i (y i (â + ˆbx i )) x i gives us 0 = 0 = ( ) ( ) y i n â ˆb x i i ( ) ( ) ( y i x i â x i ˆb i i i i x 2 i ) and the theorem follows by solving this for â and ˆb.

Univariate linear regression: how and why? vulture day heartbpm metabol mintemp maxtemp medtemp 1 01.04./02.04. 70.28 11.51-6 2-2.0 2 01.04./02.04. 66.13 11.07-6 2-2.0 3 01.04./02.04. 58.32 10.56-6 2-2.0 4 01.04./02.04. 58.63 10.62-6 2-2.0 5 01.04./02.04. 58.05 9.52-6 2-2.0 6 01.04./02.04. 66.37 7.19-6 2-2.0 7 01.04./02.04. 62.43 8.78-6 2-2.0 8 01.04./02.04. 65.83 8.24-6 2-2.0 9 01.04./02.04. 47.90 7.47-6 2-2.0 10 01.04./02.04. 51.29 7.83-6 2-2.0 11 01.04./02.04. 57.20 9.18-6 2-2.0..................... (14 different days)

Univariate linear regression: how and why? > model <- lm(metabol~heartbpm,data=vulture, subset=day=="17.05.") > summary(model) Call: lm(formula = metabol ~ heartbpm, data = vulture, subset = day == "17.05.") Residuals: Min 1Q Median 3Q Max -2.2026-0.2555 0.1005 0.6393 1.1834 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) -7.73522 0.84543-9.149 5.60e-08 *** heartbpm 0.27771 0.01207 23.016 2.98e-14 *** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual standard error: 0.912 on 17 degrees of freedom Multiple R-squared: 0.9689, Adjusted R-squared: 0.9671 F-statistic: 529.7 on 1 and 17 DF, p-value: 2.979e-14

Univariate linear regression: how and why? Optimierung der Gelegegröße Example:Cowpea weevil (also bruchid beetle) Callosobruchus maculatus German: Erbsensamenkäfer Wilson, K. (1994) Evolution of clutch size in insects. II. A test of static optimality models using the beetle Callosobruchus maculatus (Coleoptera: Bruchidae). Journal of Evolutionary Biology 7: 365 386. How does survival probability depnend on clutch size?

Univariate linear regression: how and why? Optimierung der Gelegegröße Example:Cowpea weevil (also bruchid beetle) Callosobruchus maculatus German: Erbsensamenkäfer Wilson, K. (1994) Evolution of clutch size in insects. II. A test of static optimality models using the beetle Callosobruchus maculatus (Coleoptera: Bruchidae). Journal of Evolutionary Biology 7: 365 386. How does survival probability depnend on clutch size? Which clutch size optimizes the expected number of surviving offspring?

Univariate linear regression: how and why? viability 0.0 0.2 0.4 0.6 0.8 1.0 5 10 15 20 clutchsize

Univariate linear regression: how and why? viability 0.0 0.2 0.4 0.6 0.8 1.0 5 10 15 20 clutchsize

Univariate linear regression: how and why? clutchsize * viability 2 4 6 8 5 10 15 20 clutchsize

Univariate linear regression: how and why? clutchsize * viability 2 4 6 8 5 10 15 20 clutchsize

Contents t-test for linear regression 1 Univariate linear regression: how and why? 2 t-test for linear regression 3 log-scaling the data

t-test for linear regression Example: red deer (Cervus elaphus) theory: femals can influence the sex of their offspring

t-test for linear regression Example: red deer (Cervus elaphus) theory: femals can influence the sex of their offspring Evolutionary stable strategy: weak animals may tend to have female offspring, strong animals may tend to have male offspring. Clutton-Brock, T. H., Albon, S. D., Guinness, F. E. (1986) Great expectations: dominance, breeding success and offspring sex ratios in red deer. Anim. Behav. 34, 460-471.

t-test for linear regression > hind rank ratiomales 1 0.01 0.41 2 0.02 0.15 3 0.06 0.12 4 0.08 0.04 5 0.08 0.33 6 0.09 0.37......... 52 0.96 0.81 53 0.99 0.47 54 1.00 0.67 CAUTION: Simulated data, inspired by original paper

t-test for linear regression 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 hind$rank hind$ratiomales

t-test for linear regression 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 hind$rank hind$ratiomales

t-test for linear regression > mod <- lm(ratiomales~rank,data=hind) > summary(mod) Call: lm(formula = ratiomales ~ rank, data = hind) Residuals: Min 1Q Median 3Q Max -0.32798-0.09396 0.02408 0.11275 0.37403 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 0.20529 0.04011 5.119 4.54e-06 *** rank 0.45877 0.06732 6.814 9.78e-09 *** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual standard error: 0.154 on 52 degrees of freedom Multiple R-squared: 0.4717, Adjusted R-squared: 0.4616

t-test for linear regression Model: Y = a + b X + ε mit ε N (0, σ 2 )

t-test for linear regression Model: Y = a + b X + ε mit ε N (0, σ 2 ) How to compute the significance of a relationship between the explanatory trait X and the target variable Y?

t-test for linear regression Model: Y = a + b X + ε mit ε N (0, σ 2 ) How to compute the significance of a relationship between the explanatory trait X and the target variable Y? In other words: How can we test the null hypothesis b = 0?

t-test for linear regression Model: Y = a + b X + ε mit ε N (0, σ 2 ) How to compute the significance of a relationship between the explanatory trait X and the target variable Y? In other words: How can we test the null hypothesis b = 0? We have estimated b by ˆb 0. Could the true b be 0?

t-test for linear regression Model: Y = a + b X + ε mit ε N (0, σ 2 ) How to compute the significance of a relationship between the explanatory trait X and the target variable Y? In other words: How can we test the null hypothesis b = 0? We have estimated b by ˆb 0. Could the true b be 0? How large is the standard error of ˆb?

t-test for linear regression y i = a + b x i + ε mit ε N (0, σ 2 ) not random: a, b, x i, σ 2 random: ε, y i var(y i ) = var(a + b x i + ε) = var(ε) = σ 2 and y 1, y 2,..., y n are stochastically independent.

t-test for linear regression not random: a, b, x i, σ 2 y i = a + b x i + ε mit ε N (0, σ 2 ) random: ε, y i var(y i ) = var(a + b x i + ε) = var(ε) = σ 2 and y 1, y 2,..., y n are stochastically independent. ˆb = i y i(x i x) i (x i x) 2

t-test for linear regression not random: a, b, x i, σ 2 y i = a + b x i + ε mit ε N (0, σ 2 ) random: ε, y i var(y i ) = var(a + b x i + ε) = var(ε) = σ 2 and y 1, y 2,..., y n are stochastically independent. ˆb = i y i(x i x) i (x i x) 2 ( var(ˆb) = var i y ) i(x i x) i (x i x) 2 = var ( i y i(x i x)) ( i (x i x) 2 ) 2

t-test for linear regression not random: a, b, x i, σ 2 y i = a + b x i + ε mit ε N (0, σ 2 ) random: ε, y i var(y i ) = var(a + b x i + ε) = var(ε) = σ 2 and y 1, y 2,..., y n are stochastically independent. ˆb = i y i(x i x) i (x i x) 2 ( var(ˆb) = var i y ) i(x i x) i (x = var ( i y i(x i x)) i x) 2 ( i (x i x) 2 ) 2 i = var (y i) (x i x) 2 i ( i (x = σ 2 (x i x) 2 i x) 2 ) 2 ( i (x i x) 2 ) 2

t-test for linear regression not random: a, b, x i, σ 2 y i = a + b x i + ε mit ε N (0, σ 2 ) random: ε, y i var(y i ) = var(a + b x i + ε) = var(ε) = σ 2 and y 1, y 2,..., y n are stochastically independent. ˆb = i y i(x i x) i (x i x) 2 ( var(ˆb) = var i y ) i(x i x) i (x = var ( i y i(x i x)) i x) 2 ( i (x i x) 2 ) 2 i = var (y i) (x i x) 2 i ( i (x = σ 2 (x i x) 2 i x) 2 ) 2 ( i (x i x) 2 ) 2 = σ 2 / i (x i x) 2

t-test for linear regression In fact ˆb is normally distributed with mean b and var(ˆb) = σ 2 / i (x i x) 2

t-test for linear regression In fact ˆb is normally distributed with mean b and var(ˆb) = σ 2 / i (x i x) 2 Problem: We do not know σ 2.

t-test for linear regression In fact ˆb is normally distributed with mean b and var(ˆb) = σ 2 / i (x i x) 2 Problem: We do not know σ 2. We estimate σ 2 by considering the residual variance: s 2 := i (y i â ˆb ) 2 x i n 2

t-test for linear regression In fact ˆb is normally distributed with mean b and var(ˆb) = σ 2 / i (x i x) 2 Problem: We do not know σ 2. We estimate σ 2 by considering the residual variance: s 2 := i (y i â ˆb ) 2 x i n 2 Note that we divide by n 2. The reason for this is that two model parameters a and b have been estimated, which means that two degrees of freedom got lost.

t-test for linear regression var(ˆb) = σ 2 / i (x i x) 2 Estimate σ 2 by Then s 2 = i (y i â ˆb ) 2 x i. n 2 ˆb b / s i (x i x) 2 is Student-t-distributed with n 2 degrees of freedom and we can apply the t-test to test the null hypothesis b = 0.

Contents log-scaling the data 1 Univariate linear regression: how and why? 2 t-test for linear regression 3 log-scaling the data

log-scaling the data Data example: typical body weight [kg] and and brain weight [g] of 62 mammals species (and 3 dinosaurs) > data weight.kg. brain.weight.g species extinct 1 6654.00 5712.00 african elephant no 2 1.00 6.60 no 3 3.39 44.50 no 4 0.92 5.70 no 5 2547.00 4603.00 asian elephant no 6 10.55 179.50 no 7 0.02 0.30 no 8 160.00 169.00 no 9 3.30 25.60 cat no 10 52.16 440.00 chimpanzee no 11 0.43 6.40............

log-scaling the data typische Werte bei 62 Saeugeierarten Gehirngewicht [g] 0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000 6000 Koerpergewicht [kg]

log-scaling the data typische Werte bei 65 Saeugeierarten african elephant asian elephant Gehirngewicht [g] 1e 01 1e+00 1e+01 1e+02 1e+03 mouse giraffe horse chimpanzee donkey cow rhesus monkey sheep jaguar pig potar monkey grey goat wolf kangaroo cat rabbit mountain beaver guinea pig mole rat hamster human Triceratops Diplodocus 1e 02 1e+00 1e+02 1e+04 Brachiosa Koerpergewicht [kg]

log-scaling the data 1e 02 1e+00 1e+02 1e+04 1e 01 1e+00 1e+01 1e+02 1e+03 typische Werte bei 65 Saeugeierarten Koerpergewicht [kg] Gehirngewicht [g]

log-scaling the data > modell <- lm(brain.weight.g~weight.kg.,subset=extinct=="no") > summary(modell) Call: lm(formula = brain.weight.g ~ weight.kg., subset = extinct == "no") Residuals: Min 1Q Median 3Q Max -809.95-87.43-78.55-31.17 2051.05 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 89.91213 43.58134 2.063 0.0434 * weight.kg. 0.96664 0.04769 20.269 <2e-16 *** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual standard error: 334.8 on 60 degrees of freedom Multiple R-squared: 0.8726, Adjusted R-squared: 0.8704 F-statistic: 410.8 on 1 and 60 DF, p-value: < 2.2e-16

log-scaling the data qqnorm(modell$residuals) Normal Q Q Plot Sample Quantiles 500 0 500 1000 1500 2000 2 1 0 1 2 Theoretical Quantiles

log-scaling the data plot(modell$fitted.values,modell$residuals) modell$residuals 500 0 500 1000 1500 2000 1e 02 1e+00 1e+02 1e+04 modell$model$weight.kg.

log-scaling the data plot(modell$fitted.values,modell$residuals,log= x ) modell$residuals 500 0 500 1000 1500 2000 100 200 500 1000 2000 5000 modell$fitted.values

log-scaling the data plot(modell$model$weight.kg.,modell$residuals) modell$residuals 500 0 500 1000 1500 2000 0 1000 2000 3000 4000 5000 6000 modell$model$weight.kg.

log-scaling the data plot(modell$model$weight.kg.,modell$residuals,log= x ) modell$residuals 500 0 500 1000 1500 2000 1e 02 1e+00 1e+02 1e+04 modell$model$weight.kg.

log-scaling the data We see that the residuals varaince depends on the fitted values (or the body weight): heteroscadiscity

log-scaling the data We see that the residuals varaince depends on the fitted values (or the body weight): heteroscadiscity The model assumes homoscedascity, i.e. the random deviations must be (almost) independent of the explaining traits (body weight) and the fitted values.

log-scaling the data We see that the residuals varaince depends on the fitted values (or the body weight): heteroscadiscity The model assumes homoscedascity, i.e. the random deviations must be (almost) independent of the explaining traits (body weight) and the fitted values. variance-stabilizing transformation: can be rescale body- and brain size to make deviations independent of variables

log-scaling the data Actually not so surprising: An elephant s brain of typically 5 kg can easily be 500 g lighter or heavier from individual to individual. This can not happen for a mouse brain of typically 5 g. The latter will rather also vary by 10%, i.e. 0.5 g. Thus, the variance is not additive but rather multiplicative: brain mass = (expected brain mass) random We can convert this into something with additive randomness by taking the log: log(brain mass) = log(expected brain mass) + log(random)

log-scaling the data > logmodell <- lm(log(brain.weight.g)~log(weight.kg.),subset=e > summary(logmodell) Call: lm(formula = log(brain.weight.g) ~ log(weight.kg.), subset = e "no") Residuals: Min 1Q Median 3Q Max -1.68908-0.51262-0.05016 0.46023 1.97997 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 2.11067 0.09794 21.55 <2e-16 *** log(weight.kg.) 0.74985 0.02888 25.97 <2e-16 *** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual standard error: 0.7052 on 60 degrees of freedom Multiple R-squared: 0.9183, Adjusted R-squared: 0.9169

log-scaling the data qqnorm(modell$residuals) 2 1 0 1 2 1 0 1 2 Normal Q Q Plot Theoretical Quantiles Sample Quantiles

log-scaling the data plot(logmodell$fitted.values,logmodell$residuals) 0 2 4 6 8 1 0 1 2 logmodell$fitted.values logmodell$residuals

log-scaling the data plot(logmodell$fitted.values,logmodell$residuals,log= x ) 1e 03 1e 02 1e 01 1e+00 1e+01 1 0 1 2 logmodell$fitted.values logmodell$residuals

log-scaling the data plot(weight.kg.[extinct== no ],logmodell$residuals) 0 1000 2000 3000 4000 5000 6000 1 0 1 2 weight.kg.[extinct == "no"] logmodell$residuals

log-scaling the data plot(weight.kg.[extinct= no ],logmodell$residuals,log= x ) 1e 02 1e+00 1e+02 1e+04 1 0 1 2 weight.kg.[extinct == "no"] logmodell$residuals