Inferences for linear regression (sections 12.1, 12.2)

Inferences for linear regression (sections 12.1, 12.2) Regression case history: do bigger national parks help prevent extinction? ex. area of natural reserves and extinction: 6 national parks in Tanzania (Newmark, W. D. 1996. Insularization of Tanzanian parks and local extinction of large mammals. Conservation Biology 10:1549-1556) area yrs initial present (km ) protected species species + > W! W> 100 36 35 33 137 35 26 25 1834 83 23 21 2600 38 41 40 12950 44 39 39 23051 55 49 49 species extinction modeled as an exponential decay equation: W> œ W!/ 5> 5 : extinction rate > : time in years / œ 2.71828... W! : number of species at time 0 W >: number of species remaining after > years

rearranged (solve for 5): WÎW > 5œ log / >! each park has a different value of 5; calculate using the amount of time the park has been protected. Also calculate the logarithm of the area of the park (Newmark used base 10 for this calculation): 5 log+ 0.0016345 2.00000 0.0011206 2.13672 0.0010960 3.26340 0.0006498 3.41497 0.0000000 4.11227 0.0000000 4.36269 scatterplot of 5 vs log + reveals a strong relationship between the size of the preserve and the extinction rate

0.0018 Scatterplot of k vs log_area 0.0016 0.0014 0.0012 0.0010 k 0.0008 0.0006 0.0004 0.0002 0.0000 2.0 2.5 3.0 log_area 3.5 4.0 4.5

eoutput of MINITAB regression analysis:rgressionanalysis:kversuslog_areatheregressionequationisk=0.00277-0.000627log_areapredictorcoefsecoeftpconstant0.00276560.00040776.780.002log_area-0.00062690.0001222-5.130.007s=0.000267731r-sq=86.8%r-sq(adj)=83.5%analysisofvariancesourcedfssmsfpregression11.88767e-061.88767e-0626.330.007residualeror42.86720e-077.16800e-08total52.17439e-06predictedvaluesfornewobservationsnewobsfitsefit95%ci95%pi10.0008850.000112(0.000573,0.001197)(0.000079,0.001691)valuesofpredictorsfornewobservationsnewobslog_area13.00

Residual Plots for k Normal Probability Plot of the Residuals 99 0.00050 Residuals Versus the Fitted Values 90 0.00025 Percent 50 10 Residual 0.00000-0.00025 1-0.0010-0.0005 0.0000 Residual 0.0005 0.0010-0.00050 0.0000 0.0004 0.0008 Fitted Value 0.0012 2.0 Histogram of the Residuals 0.00050 Residuals Versus the Order of the Data Frequency 1.5 1.0 0.5 Residual 0.00025 0.00000-0.00025 0.0-0.00025 0.00000 0.00025 Residual 0.00050-0.00050 1 2 3 4 Observation Order 5 6 Fitted Line Plot k = 0.002766-0.000627 log_area 0.0025 0.0020 0.0015 Regression 95% CI 95% PI S 0.0002677 R-Sq 86.8% R-Sq(adj) 83.5% 0.0010 k 0.0005 0.0000-0.0005-0.0010 2.0 2.5 3.0 3.5 log_area 4.0 4.5

sample: C response variable dependent var. B predictor variable scœ+,b least squares regression line population: at each value of B, there is a whole frequency distribution of C values C. C œ α " B. C œ α " B B B. C : population mean of C depends (linearly) on the value of BÞ α and " are the true values of the line constants in the population; + and, are their estimates from the sample. 5 : standard deviation around. C (assume it does not depend on B)

Tasks: 1. estimate α, ", 5 ; estimate. C œ α " B (the height of the line at B) 2. measure strength of linear association 3. confidence intervals for α, ". 4. hypothesis tests for α, " (especially H!: " œ 0 vs H + : " Á 0) 5. confidence interval for. C at B. 6. prediction interval for a future response, C, at B 1. estimates of model parameters data: B, C, B, C,..., B, C <œ 1 " " 2 2 8 8 8 BB 3 CC 3 8" = B = C 3œ" " s œ,œ< αs œ+œc,b < : sample correlation = : st.dev of C 's = : st.dev of B 's = C = B C 3 B 3.sC œcœ+,b s (estimated height of line at any particular value of B, used to predict C at B)

sc 3 œ +,B3 (estimated height of line at observation ) B 3 / 3 œ C3 sc3 œ C3 +,B3 ( residual or error of prediction at observation ) B 3 8 8 WWI œ / œ C +,B 3 3 3 3œ" 3œ" of squared errors) (sum 5s WWI œ 8 5s œ 5s œ WWI 8 2. measure strength of linear association <: the sample correlation coefficient <œ BB 3 CC 3 = and B = C œ 1 8 BB 3 CC 3 8 1 = = B C 3œ1 average product of properties of < : 1. 1 Ÿ < Ÿ 1 2. the closer the sample data points are to lying on a straight line, the closer < is to 1 or 1

3. <œ1 or 1: perfect linear relationship 4. scatterplot with no linear relationship: data cloud is like a shotgun blast, < near zero. 5. < near zero does not necessarily indicate absence of relationship (relationship, for example, might be strong but nonlinear) r-squared ( ): (coefficient of determination) is the square of the correlation coefficient < < < is the proportion of variability in C explained or accounted for by the regression model sc œ+,b 0 Ÿ< Ÿ 1: < œ!, the regression equation is no better than C for prediction < œ ", the regression equation predicts perfectly idea of < : prediction when there are no explanatory variables: with one quantitative response variable C and no explanatory variables, we would use C to predict a new value of C C is the least squares prediction; it is the value of - that minimizes the sum of squared errors: D/ œ DC - 3 3

think of total unexplained variability in C as the sum of squared errors caused by just using C as as predictor of C: total variability œ C C 3 then compare the above sse to the sum of squared errors in the regression model: < œ variability unexplained œ C +,B by regression model 3 3 total variabilityvariability unexplained by regression model total variability œ variability explained by regression model total variability