Discussion # 6, Water Quality and Mercury in Fish

Solution: Discussion #, Water Quality and Mercury in Fish Summary Approach The purpose of the analysis was somewhat ambiguous: analysis to determine which of the explanatory variables appears to influence the response variable might be somewhat different from analysis to develop a predictive model, since the former focuses on conclusions for individual variables while the latter doesn t. In neither case, though, is any of the explanatory variables of special a priori interest, so I think the appropriate method of analysis is model selection rather than hypothesis testing. Data splitting? Because the number of possible predictor variables is small relative to the number of observations, the data could be split into model-building and validation subsets without violating the guidelines concerning the ratio of observations to variables, especially if the split was not equal (e.g. randomly choose observations for model selection, leaving for validation). I did not do this, however, preferring to rely on PRESS for internal validation rather than using a small subset for external validation. Data Manipulations and Model Diagnostics Transformation (log or something similar) of all the explanatory variables except ph is useful to reduce the leverage of the observations with large values of these variables and to produce straighter relationship. Square-root transformation of the mercury variable makes the variability of the residuals more even, but the unevenness without this transformation is mild so I think the transformation is acceptable but not necessary. The conclusions are not greatly affected by this transformation. Examination of residual plots shows no important problems with the full or reduced models, apart from those resolved by these transformations of the variables. Results and Conclusion Alkalinity (log transformed) clearly is the single most useful water-quality variable for predicting the mean mercury level of a lake s bass. The model with only log alkalinity is best by all criteria except Cp if mercury is not transformed, and best by SBC (= BIC) if mercury is square-root transformed. Using chlorophyll (also log transformed) in addition to alkalinity may be slightly better than only alkalinity: this two-variable model is best by Cp, AICc, and PRESS if mercury is transformed, and second-best by PRESS if mercury is not transformed.

Prelminary Data Exploration Of the four explanatory variables, all but ph have skewed distributions (long upper tails); observations in the tails of these distributions could have high leverage. Log transformations eliminate this skew, and indeed log-alkalinity is somewhat skewed the opposite direction. alk 9 ph 9 9 ph cal 9... chl 7..... lncal...... lnchl 7.. 7.... As the scatterplot matrix below shows, all the variables are fairly strongly associated, positively so among the explanatory variables and negatively between them and the response variable, mercury. Most of these bivariate relationships, however, are strongly curvilinear, and strongly dominated by observations in the right tails of the skewed distributions. There also is one aberrant observation (lake, observation #9, shown by the black square) with a high level of mercury despite high levels of alkalinity and calcium... mercury. alk ph cal chl...

Log transformations largely straighten out these relationships and reduce the likely leverage of the observations with high values of the predictor variables; by doing so, they also make lake (observation 9) less unusual... mercury. ph lncal lnchl... Conclusion from data exploration Because of concerns about both nonlinearity and leverage I think it would be preferable to work with the transformed variables. In the following I show results using log transformations of the three variables (all but ph); similar results would be obtained if, for instance, alkalinity were square-root transformed. Diagnostics for Maximum Model The basic residual plots as well as the added-variable (=leverage = partial-regression) plots for the maximum model including all four possible predictor variables, all but ph having been log transformed are shown on the next page. They generally are acceptable. There does appear to be greater variability in the residuals at larger values of predicted mercury, and the distribution of the residuals is slightly skewed (long right tail). I don t feel either of these problems is severe enough to invalidate analysis using this model, but square-root transformation of the mercury variable does somewhat lessen both these concerns, as shown in the second set of residual and added-variable plots (two pages below).

mercury vs. all four variables, all but ph log transformed Normal Probability Plot of the s s Versus the s 9.. -. -.... -.....7. Histogram of the s s Versus the Order of the Data 9.. -. -...... -. Partial Regression Plot of mercury vs. Partial Regression Plot of mercury vs. ph.. mercury s.. mercury s. -. -. - - s -. - - ph s Estimated Slope of the Least Squares Line = -.9 Estimated Slope of the Least Squares Line = -.. Partial Regression Plot of mercury vs. lncal Partial Regression Plot of mercury vs. lnchl. mercury s.. mercury s. -. - lncal s -. - - - lnchl s Estimated Slope of the Least Squares Line =.9 Estimated Slope of the Least Squares Line = -.

sqrt (mercury) vs. all four variables, all but ph log transformed Normal Probability Plot Versus Fits 9... -. -. -.... -..... Histogram Versus Order... -. -.... -. Partial Regression Plot of sqrtmerc vs. Partial Regression Plot of sqrtmerc vs. ph.. sqrtmerc s.. -. sqrtmerc s.. -. -. - - s -. - - ph s Estimated Slope of the Least Squares Line = -.7 Estimated Slope of the Least Squares Line = -.. Partial Regression Plot of sqrtmerc vs. lncal Partial Regression Plot of sqrtmerc vs. lnchl.. sqrtmerc s.. sqrtmerc s.. -. -. -. - - - - lncal s lnchl s Estimated Slope of the Least Squares Line =. Estimated Slope of the Least Squares Line = -.7 I think analysis using either mercury or square-root transformed mercury is acceptable, and will show results for both in the following. A Note on AICc and SBC Values There are several ways to calculate AIC, AICc, and SBC (aka BIC). One difference is whether to include the term n ln n. Because this term is identical for all models (for a given data set), including it or not has not effect on comparisons among models, but does cause the values reported by different programs to differ. A more consequential difference is that some versions include σ in the count of parameters being estimated (giving a total count of p + ), while others only count the βs (for a count of p). This affects the p or [ln n] p terms in the formulae: if σ is counted, these terms become (p+)

and [ln n](p+). When p is small, the difference between these versions can be substantial, altering the comparisons among models of differeing sizes. The text uses p, while JMP apparently uses p+. In R, AIC uses p+ while extractaic uses p. In the following I show values computed using the formulae in the text and that I gave in lecture (i.e. including n ln n and using p rather than p+ as the number of parameters). I don t think for these data that different versions of the criteria will give different conclusions. Untransformed Mercury Model Selection Vars C-p AICc BIC PRESS variables. -.7 -.77.7. ph. lncal. lnchlor. -. -.9., lncal. -9.9 -..7, lnchlor.9 -.7 -.., ph. ph, lnchlor. -9.9 -.., lncal, lnchlor. -. -.., ph, lncal. -7. -..9, ph, lnchlor. ph, lncal, lnchlor. -7. -9..97, ph, lncal, lnchlor To facilitate comparison, these criteria are plotted against p in the following. mercury Cp AICc +ph - +ph - - - - +lnchl +lncal BIC +ph +lnchl +lncal -9 -.9.9...7 +ph +lnchl +lncal PRESS +lncal +lnchl p By AICc, BIC (= SBC), and PRESS, the model with only log-transformed alkalinity is best. The model with log-alkalinity and log-calcium is the smallest model to have Cp near p, and so would be selected by that criterion. This model also has AICc nearly as small as for the best model. Interestingly, though, by PRESS this model is worse than the other two-variable models combining either log-chlorophyll or ph with log-alkalinity.

There is much in common among the best models. Log-alkalinity is in every one of them, and is the only variable which by itself constitutes a good model. Combining log-alkalinity with either or both of log-calcium and log-chlorophyll gives good models, though whether they are better or worse than the model with only log-alkalinity depends on the criterion, as does the relative performance of these three models. Diagnostic evaluations log-alkalinity only The scatterplot of mercury vs. log-alkalinity. shows a fairly linear relationship, with one point. (lake ; blue diamond) somewhat to the left of the. main cloud of points and thus having moderate. leverage, and one point (lake, observation 9;. black square) quite far above the trend near the right side, with fairly high alkalinity and fairly high. mercury.. The plot of residuals vs. fits is quite straight. and featureless, apart from one high outlier (lake log (alkalinity) ). The observation with unusually low alkalinity (lake ) accordingly has an unusually high predicted level of mercury, but it fits the trend well and so has a small residual and presumably little influence. Interestingly, the uneven variance of the residuals seen for the full model is not apparent for this reduced model. The distribution of the residuals is fairly skewed, but with n = this is not a major problem. mercury mercury vs. log-alkalinity Normal Probability Plot of the s. s Versus the s 9.. -.... -...... Histogram of the s. s Versus the Order of the Data.. -. -...... -. Larger models plots for two other good models with either log-calcium or log-chlorophyll added to log-alkalinity are quite similar to those for the single-variable model above. When log-chlorophyll is included the distribution of residuals is closer to Normal, but there is a

somewhat stronger pattern of increasing variability with larger values of predicted mercury. Conversely, the model combining log-alkalinity with log-calcium has slightly more even variability but a less Normal distribution. In all models lake (observation 9) is an outlier with a large positive residual, and lake has the highest predicted level of mercury. mercury vs. log-alkalinity + log-calcium Normal Probability Plot of the s s Versus the s mercury vs. log-alkalinity + log-chlorophyll Normal Probability Plot of the s s Versus the s. 9.. 9.. -... -...... -.... -...... 9 Histogram of the s.. s Versus the Order of the Data Histogram of the s s Versus the Order of the Data... -. -...... -. -. -...... -. Conclusion from diagnostics I see no serious problems with any of these models. I also therefore see no reasons to consider any of these models as more or less appropriate than any of the others, and thus no reason to prefer any of the larger models over the simple single-variable model with logalkalinity. Square-root Transformed Mercury Model Selection Vars C-p AICc BIC PRESS variables. -. -7..7.7 ph. lncal. lnchlor.9 -.79-7.9.7, lnchlor. -79. -7.., lncal.9-7.9-7..79, ph. ph, lnchlor. -. -7.9.7, lncal, lnchlor.9-7.7-7..79, ph, lnchlor. -7.9-7.., ph, lncal.9 ph, lncal, lnchlor. -77. -9.9., ph, lncal, lnchlor These criteria are plotted against p in the figure on the next page. The model with log-alkalinity and log-chlorophyll is best by Cp (has the smallest Cp as well as being the smallest model with Cp near p), as well as by AICc and PRESS. The model with only log-alkalinity is best by BIC and second-best by AICc and PRESS.

sqrt(mercury) sqrtm_cp + ph -7 sqrtm_aicc + ph + lncal -79 + lncal -7-7 -7-7 + lnchl sqrtm_bic + ph + lncal + lnchl - -....7.7 + lnchl sqrtm_press + lncal + ph + lnchl p As was seen above for untransformed mercury, the model with log-alkalinity and logcalcium was the second best fitting two-variable model (by R and thus Cp, AICc, and BIC), but was somewhat worse by the PRESS critierion than the model with log-alkalinity and ph. There again is much in common among the good models: all include log-alkalinity, either alone or with one or both of log-calcium and log-chlorophyll. Diagnostic evaluations log-alkalinity + log-chlorophyll plots for this model show no substantial problems, except that yet again observation 9 (lake ) is a moderately high outlier. The distribution of residuals, while skewed, is less so than for the models above using untransformed mercury. square-root(mercury) vs. log-alkalinity + log-chlorophyll Normal Probability Plot. Versus Fits 9... -. -. -........ Histogram. Versus Order... -. -.....

log-alkalinity only The scatterplot of square-root-mercury vs.. log-alkalinity is quite similar to that shown above for untransformed mercury, showing a fairly linear. relationship with one point (lake ; blue diamond). somewhat to the left of the main cloud of points and one point (lake, observation 9; black. square) quite far above the trend near the right side,. with fairly high alkalinity and fairly high mercury. plots for this model are quite similar. to those just above for the model relating squareroot-mercury to log-alkalinity and log-chlorophyll. log(alkalinity) There again is the one high outlier (observation 9 = lake ) but no other apparent problems. square-root (mercury) vs. log (alkalinity) square-root (mercury) Normal Probability Plot. Versus Fits 9.. -. -.... -..... Histogram. Versus Order -. -....... -. Conclusion from diagnostics I again see no serious problems with either of these models, so no basis for choosing between them based on assumptions/diagnostics. Overall Conclusion Either log-alkalinity alone, or log-alkalinity and log-chlorophyll together, are the best models for explaining/predicting mercury levels in the fish. Of the various models considered, I would choose the one using square-root-transformed mercury and both log-alkalinity and logchlorophyll as predictors, since this is the best model for square-root-mercury by PRESS (my favorite criterion), and the models for square-root-mercury have somewhat larger R than those for untransformed mercury.