Regression. Correlation vs. regression. The parameters of linear regression. Regression assumes... Random sample. Y = α + β X.

Regressio Correlatio vs. regressio Predicts Y from X Liear regressio assumes that the relatioship betwee X ad Y ca be described by a lie Regressio assumes... Radom sample Y is ormally distributed with equal variace for all values of X The parameters of liear regressio Y = α + β X

Positive β β = 0 Estimatig a regressio lie Higher α Y = a + b X Negative β Lower α Nomeclature Fidig the "least squares" regressio lie Residual: Y i ˆ Y i Miimize: SS residual = i =1 ( Y i Y ˆ i )

Best estimate of the slope b = i =1 ( X i X ) Y i Y i =1 ( ) ( X i X ) (= "Sum of cross products" over "Sum of squares of X") Remember the shortcuts: X $ ' i Y i ( X i X )( Y i Y ) = & X i Y i ) i =1 % ( ( X i X ) = X i i =1 ( ) $ ' & X i ) % ( Fidig a Y = a + bx So.. a = Y bx Example: Predictig age based o radioactivity i teeth May above groud uclear bomb tests i the 50s ad 60s may have left a radioactive sigal i developig teeth. Is it possible to predict a perso s age based o detal C 14? Data from 1965 to preset from Spaldig et al. 005. Foresics: age writte i teeth by uclear tests. Nature 437: 333 334.

Teeth data: Teeth data: Δ 14 C Date of Birth 89 1985.5 109 1983.5 91 1990.5 17 1987.5 99 1990.5 110 1984.5 13 1983.5 105 1989.5 Δ 14 C Date of Birth 6 1963.5 6 1971.7 471 1963.7 11 1990.5 85 1975 439 1970. 363 197.6 391 1971.8 Let X be the estimated age, ad Y be the actual age. X = 3798, Y = 31674 X =1340776, ( XY ) = 74953 Y = 670404 =16 X = 37.375 Y =1979.63 # & ( X i X) ( Y i Y ) = % X i Y i ( $ ' i=1 i=1 = 74953 3798 # & % X i ( ( X i X) $ ' = ( X i ) X i Y i ( )( 31674) 16 ( =1340776 3798 ) = 4396 16 = 3393 Calculatig a a = Y bx = 1979.63 ( 0.053)37.375 = 199. b = 3393 4396 = 0.053

ˆ Y =199. 0.053X Predictig Y from X If a cadaver has a tooth with Δ 14 C cotet equal to 00, what does the regressio lie predict its year of birth to be?! ˆ Y =199. 0.053X ( ) =199. 0.053 00 =1981.6 r predicts the amout of variace i Y explaied by the regressio lie r is the coefficiet of determiatio: it is the square of the correlatio coefficiet r

Cautio: It is uwise to extrapolate beyod the rage of the data. More data o fish i desert pools Number of species of fish as predicted by the area of a desert pool If we were to extrapolate to ask how may species might be i a pool of 50000m, we would guess about 0. Log trasformed data: Testig hypotheses about regressio H 0 : β = 0 H A : β 0

Sums of squares for regressio SS Total = Y i SS regressio = b $ ' & Y i ) % ( ( X i X ) Y i Y SS residual + SS regressio = SS Total ( ) With - degrees of freedom for the residual Radioactive teeth: Sums of squares SS Total = Y i # & % Y i ( $ ' ( = 670404 31674 ) 16 SS regressio = b ( X i X) Y i Y ( ) =1339.75 = ( 0.053) ( 3393) =139.8 Teeth: Sums of squares Calculatig residual mea squares SS residual = SS Total SS regressio = 1339.75 139.8 = 99.9 df residual =16 =14 MS residual = SS residual / df residual MS residual = 99.9 14 = 7.1

Stadard error of a slope b has a t distributio SE b = = MS residual ( X i X) 7.1 4396 = 0.004 Cofidece iterval for a slope: b ± t α[],df SE b Hypothesis tests ca use t: t = b β 0 SE b Example: 95% cofidece iterval for slope with teeth example Cofidece bads: cofidece itervals for predictios of mea Y b ± t α[],df SE b = b ± t 0.05[],14 SE b = 0.053±.14( 0.004) = 0.053± 0.0018

Predictio itervals: cofidece itervals for predictios of idividual Y Hypothesis tests o slopes H 0 : β = 0 H A : β 0 t = b β 0 SE b t = 0.053 0 0.004 =13.5 t 0.0001(),14 = ±5.36 So we ca reject H 0, P<0.0001 No-liear relatioships Trasformatios Trasformatios Quadratic regressio If Y = ax b the ly = la + bl X. If Y = ab X the ly = la + X lb. Splies If Y = a + b X the set X " = 1, ad calculatey = a + b " X X. All of the equatios o the right have the form Y=a+bX.

No-liear relatioship: Number of fish species vs. Size of desert pool Residual plots help assess assumptios Origial: Residual plot Logs: Trasformed data Residual plot Polyomial regressio Number of species = 0.046 + 0.185 Biomass - 0.00044 Biomass

Do ot fit a polyomial with too may terms (the sample size should be at least 7 times the umber of terms) Comparig two slopes Example: Comparig species-area curves for islads to those of mailad populatios Log10(Number of species) By Log 10(Area of "islad").5 Log10(Number of species).0 1.5 1.0 0.5-1 0 1 3 4 5 6 7 Log 10(Area of "islad") Liear Fit Type of islad=i Liear Fit Type of islad=m Liear Fit Type of islad=i Hypotheses H 0 : β M = β I. H A : β M β I. The error i the differece of two slopes is ormally distributed. ( ) ( β 1 β ) t = b 1 b SE b1 b df = 1 - + -

( ( MS error ) p = SS error) 1 + ( SS error ) ( DF error ) 1 + ( DF error ) Aalysis of covariace (ANCOVA) Compares may slopes H 0 : β 1 = β = β 3 = β 4 = β 5 SE b1 b = ( MS error ) p 1 ( MS error ) p + $ ' $ ' ( X X ) & ) ( X X ) & ) % ( % ( H A : At least oe of the slopes is differet from aother. Logistic regressio Tests for relatioship betwee a umerical variable (as the explaatory variable) ad a biary variable (as the respose). e.g.: Does the dose of a toxi affect probability of survival? Does the legth of a peacock's tail affect its probability of gettig a mate?