Discussion # 6, Water Quality and Mercury in Fish

Similar documents
Soil Phosphorus Discussion

10. Alternative case influence statistics

Chapter 1 Statistical Inference

Data Set 1A: Algal Photosynthesis vs. Salinity and Temperature

Topic 18: Model Selection and Diagnostics

Regression Diagnostics Procedures

Fish act Water temp

The Model Building Process Part I: Checking Model Assumptions Best Practice (Version 1.1)

LAB 3 INSTRUCTIONS SIMPLE LINEAR REGRESSION

401 Review. 6. Power analysis for one/two-sample hypothesis tests and for correlation analysis.

AP Statistics. Chapter 9 Re-Expressing data: Get it Straight

Simple Linear Regression

AP Statistics. Chapter 6 Scatterplots, Association, and Correlation

UNIT 12 ~ More About Regression

The Model Building Process Part I: Checking Model Assumptions Best Practice

Introduction to Linear regression analysis. Part 2. Model comparisons

Analyzing the NYC Subway Dataset

How the mean changes depends on the other variable. Plots can show what s happening...

Nature vs. nurture? Lecture 18 - Regression: Inference, Outliers, and Intervals. Regression Output. Conditions for inference.

Conditions for Regression Inference:

STA Module 5 Regression and Correlation. Learning Objectives. Learning Objectives (Cont.) Upon completing this module, you should be able to:

Correlation and Regression

Bivariate Data Summary

Chapter 7. Scatterplots, Association, and Correlation

Chapter 8. Linear Regression. Copyright 2010 Pearson Education, Inc.

Statistical View of Least Squares

3.1 Scatterplots and Correlation

Regression. Estimation of the linear function (straight line) describing the linear component of the joint relationship between two variables X and Y.

AMS 7 Correlation and Regression Lecture 8

Statistics 512: Solution to Homework#11. Problems 1-3 refer to the soybean sausage dataset of Problem 20.8 (ch21pr08.dat).

Chapter 8. Linear Regression. The Linear Model. Fat Versus Protein: An Example. The Linear Model (cont.) Residuals

Correlation and Regression Theory 1) Multivariate Statistics

Applied Econometrics (QEM)

9. Linear Regression and Correlation

Project Report for STAT571 Statistical Methods Instructor: Dr. Ramon V. Leon. Wage Data Analysis. Yuanlei Zhang

PBAF 528 Week 8. B. Regression Residuals These properties have implications for the residuals of the regression.

What to do if Assumptions are Violated?

Chapter 3: Examining Relationships

Confidence Intervals, Testing and ANOVA Summary

Simple Linear Regression. (Chs 12.1, 12.2, 12.4, 12.5)

appstats8.notebook October 11, 2016

Lectures on Simple Linear Regression Stat 431, Summer 2012

, (1) e i = ˆσ 1 h ii. c 2016, Jeffrey S. Simonoff 1

Stat 101 Exam 1 Important Formulas and Concepts 1

Density Temp vs Ratio. temp

Stat 101 L: Laboratory 5

Any of 27 linear and nonlinear models may be fit. The output parallels that of the Simple Regression procedure.

Business Statistics. Lecture 9: Simple Regression

REVIEW 8/2/2017 陈芳华东师大英语系

Regression Models - Introduction

Tutorial 6: Linear Regression

Psychology Seminar Psych 406 Dr. Jeffrey Leitzel

3rd Quartile. 1st Quartile) Minimum

Regression Analysis V... More Model Building: Including Qualitative Predictors, Model Searching, Model "Checking"/Diagnostics

Regression Analysis V... More Model Building: Including Qualitative Predictors, Model Searching, Model "Checking"/Diagnostics

Lecture 3. The Population Variance. The population variance, denoted σ 2, is the sum. of the squared deviations about the population

Chapter 10. Regression. Understandable Statistics Ninth Edition By Brase and Brase Prepared by Yixun Shi Bloomsburg University of Pennsylvania

INFERENCE FOR REGRESSION

Linear Regression. Linear Regression. Linear Regression. Did You Mean Association Or Correlation?

Chapter 3: Describing Relationships

Relationships Regression

y response variable x 1, x 2,, x k -- a set of explanatory variables

7.0 Lesson Plan. Regression. Residuals

Math 423/533: The Main Theoretical Topics

Inferences for linear regression (sections 12.1, 12.2)

Linear Regression In God we trust, all others bring data. William Edwards Deming

Chapter 6. September 17, Please pick up a calculator and take out paper and something to write with. Association and Correlation.

K. Model Diagnostics. residuals ˆɛ ij = Y ij ˆµ i N = Y ij Ȳ i semi-studentized residuals ω ij = ˆɛ ij. studentized deleted residuals ɛ ij =

Linear Regression is a very popular method in science and engineering. It lets you establish relationships between two or more numerical variables.

Simple Linear Regression

Regression. Marc H. Mehlman University of New Haven

Chapter 3: Regression Methods for Trends

Review of Statistics 101

HOLLOMAN S AP STATISTICS BVD CHAPTER 08, PAGE 1 OF 11. Figure 1 - Variation in the Response Variable

Chapter 9 Regression. 9.1 Simple linear regression Linear models Least squares Predictions and residuals.

Nonlinear Regression. Summary. Sample StatFolio: nonlinear reg.sgp

Business Statistics. Lecture 10: Correlation and Linear Regression

MATH 1150 Chapter 2 Notation and Terminology

7. Do not estimate values for y using x-values outside the limits of the data given. This is called extrapolation and is not reliable.

MODELING. Simple Linear Regression. Want More Stats??? Crickets and Temperature. Crickets and Temperature 4/16/2015. Linear Model

Regression Model Building

Announcements. Lecture 18: Simple Linear Regression. Poverty vs. HS graduate rate

CHAPTER 5. Outlier Detection in Multivariate Data

Chapter 3. Measuring data

Scatterplots and Correlation

Regression Analysis. Regression: Methodology for studying the relationship among two or more variables

AP Statistics Cumulative AP Exam Study Guide

Topic 23: Diagnostics and Remedies

Sociology 6Z03 Review I

Contents. Acknowledgments. xix

Estimating σ 2. We can do simple prediction of Y and estimation of the mean of Y at any value of X.

Probability Distributions

Final Review. Yang Feng. Yang Feng (Columbia University) Final Review 1 / 58

AP Statistics L I N E A R R E G R E S S I O N C H A P 7

Introduction to Statistical modeling: handout for Math 489/583

Chapter 7 Summary Scatterplots, Association, and Correlation

Multiple linear regression S6

Box-Cox Transformations

CS 5014: Research Methods in Computer Science

Chapter 7. Scatterplots, Association, and Correlation. Copyright 2010 Pearson Education, Inc.

Transcription:

Solution: Discussion #, Water Quality and Mercury in Fish Summary Approach The purpose of the analysis was somewhat ambiguous: analysis to determine which of the explanatory variables appears to influence the response variable might be somewhat different from analysis to develop a predictive model, since the former focuses on conclusions for individual variables while the latter doesn t. In neither case, though, is any of the explanatory variables of special a priori interest, so I think the appropriate method of analysis is model selection rather than hypothesis testing. Data splitting? Because the number of possible predictor variables is small relative to the number of observations, the data could be split into model-building and validation subsets without violating the guidelines concerning the ratio of observations to variables, especially if the split was not equal (e.g. randomly choose observations for model selection, leaving for validation). I did not do this, however, preferring to rely on PRESS for internal validation rather than using a small subset for external validation. Data Manipulations and Model Diagnostics Transformation (log or something similar) of all the explanatory variables except ph is useful to reduce the leverage of the observations with large values of these variables and to produce straighter relationship. Square-root transformation of the mercury variable makes the variability of the residuals more even, but the unevenness without this transformation is mild so I think the transformation is acceptable but not necessary. The conclusions are not greatly affected by this transformation. Examination of residual plots shows no important problems with the full or reduced models, apart from those resolved by these transformations of the variables. Results and Conclusion Alkalinity (log transformed) clearly is the single most useful water-quality variable for predicting the mean mercury level of a lake s bass. The model with only log alkalinity is best by all criteria except Cp if mercury is not transformed, and best by SBC (= BIC) if mercury is square-root transformed. Using chlorophyll (also log transformed) in addition to alkalinity may be slightly better than only alkalinity: this two-variable model is best by Cp, AICc, and PRESS if mercury is transformed, and second-best by PRESS if mercury is not transformed.

Prelminary Data Exploration Of the four explanatory variables, all but ph have skewed distributions (long upper tails); observations in the tails of these distributions could have high leverage. Log transformations eliminate this skew, and indeed log-alkalinity is somewhat skewed the opposite direction. alk 9 ph 9 9 ph cal 9... chl 7..... lncal...... lnchl 7.. 7.... As the scatterplot matrix below shows, all the variables are fairly strongly associated, positively so among the explanatory variables and negatively between them and the response variable, mercury. Most of these bivariate relationships, however, are strongly curvilinear, and strongly dominated by observations in the right tails of the skewed distributions. There also is one aberrant observation (lake, observation #9, shown by the black square) with a high level of mercury despite high levels of alkalinity and calcium... mercury. alk ph cal chl...

Log transformations largely straighten out these relationships and reduce the likely leverage of the observations with high values of the predictor variables; by doing so, they also make lake (observation 9) less unusual... mercury. ph lncal lnchl... Conclusion from data exploration Because of concerns about both nonlinearity and leverage I think it would be preferable to work with the transformed variables. In the following I show results using log transformations of the three variables (all but ph); similar results would be obtained if, for instance, alkalinity were square-root transformed. Diagnostics for Maximum Model The basic residual plots as well as the added-variable (=leverage = partial-regression) plots for the maximum model including all four possible predictor variables, all but ph having been log transformed are shown on the next page. They generally are acceptable. There does appear to be greater variability in the residuals at larger values of predicted mercury, and the distribution of the residuals is slightly skewed (long right tail). I don t feel either of these problems is severe enough to invalidate analysis using this model, but square-root transformation of the mercury variable does somewhat lessen both these concerns, as shown in the second set of residual and added-variable plots (two pages below).

mercury vs. all four variables, all but ph log transformed Normal Probability Plot of the s s Versus the s 9.. -. -.... -.....7. Histogram of the s s Versus the Order of the Data 9.. -. -...... -. Partial Regression Plot of mercury vs. Partial Regression Plot of mercury vs. ph.. mercury s.. mercury s. -. -. - - s -. - - ph s Estimated Slope of the Least Squares Line = -.9 Estimated Slope of the Least Squares Line = -.. Partial Regression Plot of mercury vs. lncal Partial Regression Plot of mercury vs. lnchl. mercury s.. mercury s. -. - lncal s -. - - - lnchl s Estimated Slope of the Least Squares Line =.9 Estimated Slope of the Least Squares Line = -.

sqrt (mercury) vs. all four variables, all but ph log transformed Normal Probability Plot Versus Fits 9... -. -. -.... -..... Histogram Versus Order... -. -.... -. Partial Regression Plot of sqrtmerc vs. Partial Regression Plot of sqrtmerc vs. ph.. sqrtmerc s.. -. sqrtmerc s.. -. -. - - s -. - - ph s Estimated Slope of the Least Squares Line = -.7 Estimated Slope of the Least Squares Line = -.. Partial Regression Plot of sqrtmerc vs. lncal Partial Regression Plot of sqrtmerc vs. lnchl.. sqrtmerc s.. sqrtmerc s.. -. -. -. - - - - lncal s lnchl s Estimated Slope of the Least Squares Line =. Estimated Slope of the Least Squares Line = -.7 I think analysis using either mercury or square-root transformed mercury is acceptable, and will show results for both in the following. A Note on AICc and SBC Values There are several ways to calculate AIC, AICc, and SBC (aka BIC). One difference is whether to include the term n ln n. Because this term is identical for all models (for a given data set), including it or not has not effect on comparisons among models, but does cause the values reported by different programs to differ. A more consequential difference is that some versions include σ in the count of parameters being estimated (giving a total count of p + ), while others only count the βs (for a count of p). This affects the p or [ln n] p terms in the formulae: if σ is counted, these terms become (p+)

and [ln n](p+). When p is small, the difference between these versions can be substantial, altering the comparisons among models of differeing sizes. The text uses p, while JMP apparently uses p+. In R, AIC uses p+ while extractaic uses p. In the following I show values computed using the formulae in the text and that I gave in lecture (i.e. including n ln n and using p rather than p+ as the number of parameters). I don t think for these data that different versions of the criteria will give different conclusions. Untransformed Mercury Model Selection Vars C-p AICc BIC PRESS variables. -.7 -.77.7. ph. lncal. lnchlor. -. -.9., lncal. -9.9 -..7, lnchlor.9 -.7 -.., ph. ph, lnchlor. -9.9 -.., lncal, lnchlor. -. -.., ph, lncal. -7. -..9, ph, lnchlor. ph, lncal, lnchlor. -7. -9..97, ph, lncal, lnchlor To facilitate comparison, these criteria are plotted against p in the following. mercury Cp AICc +ph - +ph - - - - +lnchl +lncal BIC +ph +lnchl +lncal -9 -.9.9...7 +ph +lnchl +lncal PRESS +lncal +lnchl p By AICc, BIC (= SBC), and PRESS, the model with only log-transformed alkalinity is best. The model with log-alkalinity and log-calcium is the smallest model to have Cp near p, and so would be selected by that criterion. This model also has AICc nearly as small as for the best model. Interestingly, though, by PRESS this model is worse than the other two-variable models combining either log-chlorophyll or ph with log-alkalinity.

There is much in common among the best models. Log-alkalinity is in every one of them, and is the only variable which by itself constitutes a good model. Combining log-alkalinity with either or both of log-calcium and log-chlorophyll gives good models, though whether they are better or worse than the model with only log-alkalinity depends on the criterion, as does the relative performance of these three models. Diagnostic evaluations log-alkalinity only The scatterplot of mercury vs. log-alkalinity. shows a fairly linear relationship, with one point. (lake ; blue diamond) somewhat to the left of the. main cloud of points and thus having moderate. leverage, and one point (lake, observation 9;. black square) quite far above the trend near the right side, with fairly high alkalinity and fairly high. mercury.. The plot of residuals vs. fits is quite straight. and featureless, apart from one high outlier (lake log (alkalinity) ). The observation with unusually low alkalinity (lake ) accordingly has an unusually high predicted level of mercury, but it fits the trend well and so has a small residual and presumably little influence. Interestingly, the uneven variance of the residuals seen for the full model is not apparent for this reduced model. The distribution of the residuals is fairly skewed, but with n = this is not a major problem. mercury mercury vs. log-alkalinity Normal Probability Plot of the s. s Versus the s 9.. -.... -...... Histogram of the s. s Versus the Order of the Data.. -. -...... -. Larger models plots for two other good models with either log-calcium or log-chlorophyll added to log-alkalinity are quite similar to those for the single-variable model above. When log-chlorophyll is included the distribution of residuals is closer to Normal, but there is a

somewhat stronger pattern of increasing variability with larger values of predicted mercury. Conversely, the model combining log-alkalinity with log-calcium has slightly more even variability but a less Normal distribution. In all models lake (observation 9) is an outlier with a large positive residual, and lake has the highest predicted level of mercury. mercury vs. log-alkalinity + log-calcium Normal Probability Plot of the s s Versus the s mercury vs. log-alkalinity + log-chlorophyll Normal Probability Plot of the s s Versus the s. 9.. 9.. -... -...... -.... -...... 9 Histogram of the s.. s Versus the Order of the Data Histogram of the s s Versus the Order of the Data... -. -...... -. -. -...... -. Conclusion from diagnostics I see no serious problems with any of these models. I also therefore see no reasons to consider any of these models as more or less appropriate than any of the others, and thus no reason to prefer any of the larger models over the simple single-variable model with logalkalinity. Square-root Transformed Mercury Model Selection Vars C-p AICc BIC PRESS variables. -. -7..7.7 ph. lncal. lnchlor.9 -.79-7.9.7, lnchlor. -79. -7.., lncal.9-7.9-7..79, ph. ph, lnchlor. -. -7.9.7, lncal, lnchlor.9-7.7-7..79, ph, lnchlor. -7.9-7.., ph, lncal.9 ph, lncal, lnchlor. -77. -9.9., ph, lncal, lnchlor These criteria are plotted against p in the figure on the next page. The model with log-alkalinity and log-chlorophyll is best by Cp (has the smallest Cp as well as being the smallest model with Cp near p), as well as by AICc and PRESS. The model with only log-alkalinity is best by BIC and second-best by AICc and PRESS.

sqrt(mercury) sqrtm_cp + ph -7 sqrtm_aicc + ph + lncal -79 + lncal -7-7 -7-7 + lnchl sqrtm_bic + ph + lncal + lnchl - -....7.7 + lnchl sqrtm_press + lncal + ph + lnchl p As was seen above for untransformed mercury, the model with log-alkalinity and logcalcium was the second best fitting two-variable model (by R and thus Cp, AICc, and BIC), but was somewhat worse by the PRESS critierion than the model with log-alkalinity and ph. There again is much in common among the good models: all include log-alkalinity, either alone or with one or both of log-calcium and log-chlorophyll. Diagnostic evaluations log-alkalinity + log-chlorophyll plots for this model show no substantial problems, except that yet again observation 9 (lake ) is a moderately high outlier. The distribution of residuals, while skewed, is less so than for the models above using untransformed mercury. square-root(mercury) vs. log-alkalinity + log-chlorophyll Normal Probability Plot. Versus Fits 9... -. -. -........ Histogram. Versus Order... -. -.....

log-alkalinity only The scatterplot of square-root-mercury vs.. log-alkalinity is quite similar to that shown above for untransformed mercury, showing a fairly linear. relationship with one point (lake ; blue diamond). somewhat to the left of the main cloud of points and one point (lake, observation 9; black. square) quite far above the trend near the right side,. with fairly high alkalinity and fairly high mercury. plots for this model are quite similar. to those just above for the model relating squareroot-mercury to log-alkalinity and log-chlorophyll. log(alkalinity) There again is the one high outlier (observation 9 = lake ) but no other apparent problems. square-root (mercury) vs. log (alkalinity) square-root (mercury) Normal Probability Plot. Versus Fits 9.. -. -.... -..... Histogram. Versus Order -. -....... -. Conclusion from diagnostics I again see no serious problems with either of these models, so no basis for choosing between them based on assumptions/diagnostics. Overall Conclusion Either log-alkalinity alone, or log-alkalinity and log-chlorophyll together, are the best models for explaining/predicting mercury levels in the fish. Of the various models considered, I would choose the one using square-root-transformed mercury and both log-alkalinity and logchlorophyll as predictors, since this is the best model for square-root-mercury by PRESS (my favorite criterion), and the models for square-root-mercury have somewhat larger R than those for untransformed mercury.