Simple Linear Regression for the Climate Data

Prediction Prediction Interval Temperature 0.2 0.0 0.2 0.4 0.6 0.8 320 340 360 380 CO 2 Simple Linear Regression for the Climate Data

What do we do with the data? y i = Temperature of i th Year x i =CO 2 in i th Year i =1,...,n n = 55 Sample Size Primary Research Questions: 1. How does CO 2 relate to temperature? 2. Can we predict temperatures based on a CO 2 scenario?

Exploratory Results r =0.93 Temperature 0.2 0.0 0.2 0.4 0.6 320 340 360 380 CO 2 Cov(X, Y )=5.55 1. Form Linear? 2. Direction Positive or Negative 3. Strength 4. Outliers

SLR Model Fit and Assumptions ŷ = 3.08 + 0.01 (CO 2 ) ˆ =0.092 R 2 =0.8649 Predictive Bias 0 PRMSE = 0.1 1. Linear? 2. Independent? 3. Normal? 4. Equal Variance? 320 340 360 380 0.2 0.0 0.2 0.4 0.6 CO2 Temperature Histogram of Standardized Residuals Standardized Residuals Frequency 2 1 0 1 2 0 2 4 6 8 10 0.0 0.2 0.4 0.6 0.2 0.1 0.0 0.1 0.2 Fitted Values Residuals

Accounting for Uncertainty in Predictions Q: If CO2 were to increase by 1, how much would global temperatures increase on average? A: Our best guess is 0.01 but we are uncertain. Note: If we took another sample, our estimate ˆ1 would change. How do we incorporate sampling variability (uncertainty) into our regression results?

Accounting for Uncertainty in Parameters Ways to Incorporate Uncertainty in Claims About 1: 1. Hypothesis Testing T-distribution with One can show (in Stat 535, not in 330) : n-2 degrees of freedom t = ˆ1 1 SE( ˆ1) = ˆ1 1 T n 2 p ˆ P(xi x) 2 Such that the t-distribution can be used to compute p-values.

Accounting for Uncertainty in Parameters Ways to Incorporate Uncertainty in Claims About 1: 1. Hypothesis Testing For example, the p-value for the test H 0 : 1 =0 H a : 1 6= 0 t = 18.37 is, essentially, 0. So, our conclusion, that accounts for uncertainty, is: The effect of CO 2 on global temperature is not zero. Or, CO 2 has a significant effects on temperature. Density 0.0 0.1 0.2 0.3 0.4 t t statistic t

Accounting for Uncertainty in Parameters Ways to Incorporate Uncertainty in Claims About 1: 1. Hypothesis Testing For example, the p-value for the test H 0 : 1 =0 H a : 1 6= 0 t = 18.37 is, essentially, 0. How do you interpret the p-value? Assuming the null hypothesis is true, the probability of observing a slope of 0.01, or more extreme, is essentially zero. Note: Statements about statistical significance are really just asking for you to incorporate uncertainty in your results. Density 0.0 0.1 0.2 0.3 0.4 t t statistic t

Accounting for Uncertainty in Parameters Ways to Incorporate Uncertainty in Claims About 1: 2. Confidence Intervals Because, we know ˆ1 1 SE( ˆ1) T n 2 Prob t? 0.025 < ˆ1 1 SE( ˆ1) <t? 0.975 2.5% quantile: value such that 2.5% is BELOW it. 97.5% quantile: value such that 97.5% is BELOW it.! =0.95.

Accounting for Uncertainty in Parameters Ways to Incorporate Uncertainty in Claims About 1: 2. Confidence Intervals Because, ˆ1 1 SE( ˆ1) T n 2 we know (more generally) for any 0<α<1 Prob t? /2 < ˆ1 1 SE( ˆ1) <t? 1 /2! =1.

Accounting for Uncertainty in Parameters Ways to Incorporate Uncertainty in Claims About 1: 2. Confidence Intervals Rearranging, we find h i Prob ˆ1 t? 1 /2SE( ˆ1) < 1 < h i = Prob ˆ1 t? 1 /2SE( ˆ1) < 1 < h ˆ1 t? /2SE( ˆ1)i =0.95. h ˆ1 + t? 1 /2SE( ˆ1)i Which gives rise to the (100-α)% confidence interval formula: ˆ1 ± t? 1 /2SE( ˆ1) (0.0085, 0.0106)

Accounting for Uncertainty in Parameters Ways to Incorporate Uncertainty in Claims About 1: 2. Confidence Intervals Q: If CO 2 were to increase by 1, how much would global temperatures increase on average? (0.0085, 0.0106) We are 95% confident that a 1 unit increase in CO 2 would increase global temperature by between 0.0085 and 0.0106 degrees, on average.

Accounting for Uncertainty in Parameters Ways to Incorporate Uncertainty in Claims About 1: 2. Confidence Intervals Q: What do we mean by confident? If sampling were to be repeated, 95% of all such confidence intervals would contain the true increase in global temperatures. Important Note: The uncertainty that you express here is only the uncertainty due to sampling variability. That is, the confidence intervals quantify the uncertainty that arises by basing conclusions about a population from a sample.

Accounting for Uncertainty in Parameters Ways to Incorporate Uncertainty in Claims About 0: 1. Hypothesis Testing 2. Confidence Intervals t = ˆ0 0 SE( ˆ0) = ˆ0 q 0 T n 2 1 ˆ P (xi x) 2 n + x 2 ˆ0 ± t? 1 /2SE( ˆ0) H 0 : 0 =0 H a : 0 < 0 p value 0 ( 3.45, 2.72) But, this is hardly ever done because 0 is hard to interpret!

Centered SLR Model Issue: 0 is (often) not interpretable. Solution: Center the predictor (covariate): Then fit the model: Least Squares Estimators: ˆ? 0 =ȳ ˆ? 1 x? =ȳ x? i = x i x y i = 0? + 1x?? i P n ˆ? i=1 1 = (y i ȳ)x? i P n = i=1 (x? i )2 P n i=1 (y i ȳ)(x i x) P n i=1 (x i x) 2 = ˆ1 ˆ? = s PN i=1 (y i N ˆ? 0 ˆ1x? i )2 = s PN i=1 (y i N ˆ0 ˆ1x i ) 2 =ˆ

Centered SLR Model Advantages of Centered Model: 1. Intercept is now interpretable: the predicted value of y when x i = x OR the average value of y. 2. The slope and variation parameter stay the same. 3. Slope and intercept are now independent. Corr( ˆ? 0, ˆ1) =0 The Centered Model: ŷ =0.25 + 0.01x?

Accounting for Uncertainty In Parameters? Ways to Incorporate Uncertainty in Claims About 0 : 1. Hypothesis Testing 2. Confidence Intervals ˆ?? ˆ? t = 0 0 SE( ˆ? 0 ) = 0 ˆ/ p n T n 2? H 0 : 0 =0 H a :? 0 > 0 p value 0? 0 ˆ? 0 ± t? 1 /2 SE( ˆ? 0 ) (0.226, 0.276) We are 95% confident that the average global temperature is between 0.226 and 0.276 degrees.

Accounting for Uncertainty in Predictions Q: Projections indicate an increase of CO 2 into the future. What do you predict will be the global temperature for a CO 2 level of 400? Recall: ŷ = 3.08 + 0.01x Uncentered ŷ =0.25 + 0.01x? Centered A: Our best guess is 0.729 but we are uncertain. Note: If we took another sample, our prediction would change. How do we incorporate sampling variability (uncertainty) into our predictions?

Accounting for Uncertainty In Parameters Ways to Incorporate Uncertainty in Predictions: 1. Confidence Intervals for the Mean Stat 535 Result: which can be rearranged to: ˆ0 + ˆ1x ( 0 + 1 x) SE( ˆ0 + ˆ1x) Prob t? /2 < ˆ0 + ˆ1x ( 0 + 1 x) SE( ˆ0 + ˆ1x) s SE( ˆ0 + ˆ1x) 1 (x x)2 =ˆ + P n (xi x) 2 <t? 1 /2 ( ˆ0 + ˆ1x) ± t? 1 /2 SE( ˆ0 + ˆ1x) T n 2! =1. Variability depends on where you are predicting

Accounting for Uncertainty In Parameters Ways to Incorporate Uncertainty in Predictions: 2. Predictions Intervals for Individuals ŷ y Stat 535 Result: SE(ŷ) T n 2 Prob t? /2 < ŷ y SE(ŷ) <t? 1 /2 =1. s SE(ŷ) =ˆ 1+ 1 (x x)2 + P n (xi x) 2 which can be rearranged to: ŷ ± t? 1 /2 SE(ŷ) Variability depends on where you are predicting

Accounting for Uncertainty in Predictions Q: Projections indicate an increase of CO 2 into the future. What do you predict will be the global temperature for a CO 2 level of 400? Recall: ŷ = 3.08 + 0.01x Uncentered ŷ =0.25 + 0.01x? Centered Confidence Int. For Mean: 0.728 ± 2.01(0.028) = (0.671, 0.786) Predictive Int.: 0.728 ± 2.01(0.096) = (0.535, 0.922) A: We are 95% confident that if the CO2 level was 400, the global temperature would be between 0.535 and 0.922.

Accounting for Uncertainty in Predictions Q: Projections indicate an increase of CO 2 into the future. What do you predict will be the global temperature for a CO 2 level of 400? Recall: ŷ = 3.08 + 0.01x Uncentered ŷ =0.25 + 0.01x? Centered Temperature 0.2 0.0 0.2 0.4 0.6 0.8 Prediction Prediction Interval Confidence Interval 320 340 360 380 CO 2

Cross-Validation Revisited When we perform cross-validation, we are used to calculating: 1. Bias 2. RPMSE But prediction intervals should also be generated for each test observation and calculate the following: 1. Coverage = % of prediction intervals that contain the true value 2. Predictive Interval Width = average width of prediction interval

Regression Assumptions Revisited What happens if our assumptions aren t met: Linearity if non-linear, everything breaks! Don t fit a line to non-linear data! Independence estimates are still unbiased (i.e. we fit the right line) but measures of the accuracy of those estimates (the standard errors) are typically too small. Normality estimates are still unbiased (i.e. we fit the right line), standard errors are correct BUT confidence/prediction intervals are wrong (can t use t-distribution). Equal variance estimates are still unbiased but standard errors are wrong (and we don t know how wrong).

End of Climate Analysis (see webpage for R and SAS code)