Statistics GIDP Ph.D. Qualifying Exam Methodology

Similar documents
Statistics GIDP Ph.D. Qualifying Exam Methodology

Statistics GIDP Ph.D. Qualifying Exam Methodology

Statistics GIDP Ph.D. Qualifying Exam Methodology January 10, 9:00am-1:00pm

Statistics GIDP Ph.D. Qualifying Exam Methodology May 26 9:00am-1:00pm

Statistics GIDP Ph.D. Qualifying Exam Methodology

Statistics GIDP Ph.D. Qualifying Exam Methodology May 26 9:00am-1:00pm

Lecture 10: Experiments with Random Effects

Topic 17 - Single Factor Analysis of Variance. Outline. One-way ANOVA. The Data / Notation. One way ANOVA Cell means model Factor effects model

Ch 2: Simple Linear Regression

Advanced Statistical Regression Analysis: Mid-Term Exam Chapters 1-5

Weighted Least Squares

Analysing data: regression and correlation S6 and S7

SLR output RLS. Refer to slr (code) on the Lecture Page of the class website.

SAS Procedures Inference about the Line ffl model statement in proc reg has many options ffl To construct confidence intervals use alpha=, clm, cli, c

Outline. Topic 20 - Diagnostics and Remedies. Residuals. Overview. Diagnostics Plots Residual checks Formal Tests. STAT Fall 2013

General Linear Model (Chapter 4)

K. Model Diagnostics. residuals ˆɛ ij = Y ij ˆµ i N = Y ij Ȳ i semi-studentized residuals ω ij = ˆɛ ij. studentized deleted residuals ɛ ij =

Chapter 11. Analysis of Variance (One-Way)

Scatter plot of data from the study. Linear Regression

COMPLETELY RANDOM DESIGN (CRD) -Design can be used when experimental units are essentially homogeneous.

1 A Review of Correlation and Regression

First Year Examination Department of Statistics, University of Florida

Lecture 11: Simple Linear Regression

COMPREHENSIVE WRITTEN EXAMINATION, PAPER III FRIDAY AUGUST 26, 2005, 9:00 A.M. 1:00 P.M. STATISTICS 174 QUESTION

Applied Regression Analysis

ANOVA (Analysis of Variance) output RLS 11/20/2016

Categorical Predictor Variables

Lecture 3. Experiments with a Single Factor: ANOVA Montgomery 3.1 through 3.3

Scatter plot of data from the study. Linear Regression

Topic 20: Single Factor Analysis of Variance

Lecture 2. The Simple Linear Regression Model: Matrix Approach

STOR 455 STATISTICAL METHODS I

Inference for Regression

Lecture 10: 2 k Factorial Design Montgomery: Chapter 6

Handout 4: Simple Linear Regression

DESAIN EKSPERIMEN Analysis of Variances (ANOVA) Semester Genap 2017/2018 Jurusan Teknik Industri Universitas Brawijaya

Lecture 9: Factorial Design Montgomery: chapter 5

df=degrees of freedom = n - 1

Regression Analysis IV... More MLR and Model Building

STAT 571A Advanced Statistical Regression Analysis. Chapter 8 NOTES Quantitative and Qualitative Predictors for MLR

Simple Linear Regression

Lectures on Simple Linear Regression Stat 431, Summer 2012

Statistics 512: Solution to Homework#11. Problems 1-3 refer to the soybean sausage dataset of Problem 20.8 (ch21pr08.dat).

Week 7.1--IES 612-STA STA doc

This is a Randomized Block Design (RBD) with a single factor treatment arrangement (2 levels) which are fixed.

Statistical Techniques II EXST7015 Simple Linear Regression

(ii) Scan your answer sheets INTO ONE FILE only, and submit it in the drop-box.

STAT 512 MidTerm I (2/21/2013) Spring 2013 INSTRUCTIONS

STAT 525 Fall Final exam. Tuesday December 14, 2010

BIOL 458 BIOMETRY Lab 9 - Correlation and Bivariate Regression

STAT 8200 Design of Experiments for Research Workers Lab 11 Due: Friday, Nov. 22, 2013

13 Simple Linear Regression

Regression. Marc H. Mehlman University of New Haven

Week 3: Simple Linear Regression

cor(dataset$measurement1, dataset$measurement2, method= pearson ) cor.test(datavector1, datavector2, method= pearson )

Lecture 3: Inference in SLR

Simple Linear Regression

Lecture 12: 2 k Factorial Design Montgomery: Chapter 6

Lecture 11 Multiple Linear Regression

STAT22200 Spring 2014 Chapter 14

14 Multiple Linear Regression

Lecture 2: Basic Concepts and Simple Comparative Experiments Montgomery: Chapter 2

Density Temp vs Ratio. temp

Analysis of Variance

Chapter 5 Exercises 1

Regression, Part I. - In correlation, it would be irrelevant if we changed the axes on our graph.

Stat 500 Midterm 2 12 November 2009 page 0 of 11

Topic 23: Diagnostics and Remedies

Lecture 7: Latin Square and Related Design

Lecture 1 Linear Regression with One Predictor Variable.p2

What If There Are More Than. Two Factor Levels?

MATH11400 Statistics Homepage

PLS205 Lab 2 January 15, Laboratory Topic 3

STAT 3A03 Applied Regression With SAS Fall 2017

Analysis of Covariance

Statistical Modelling in Stata 5: Linear Models

Lecture 4. Random Effects in Completely Randomized Design

Residual Analysis for two-way ANOVA The twoway model with K replicates, including interaction,

STAT 350. Assignment 4

22s:152 Applied Linear Regression. Take random samples from each of m populations.

STATISTICS 479 Exam II (100 points)

Chapter 16. Simple Linear Regression and Correlation

Weighted Least Squares

Lecture 14 Simple Linear Regression

22s:152 Applied Linear Regression. There are a couple commonly used models for a one-way ANOVA with m groups. Chapter 8: ANOVA

Lecture 3. Experiments with a Single Factor: ANOVA Montgomery 3-1 through 3-3

Topic 32: Two-Way Mixed Effects Model

Scenarios Where Utilizing a Spline Model in Developing a Regression Model Is Appropriate

STAT 705 Chapter 16: One-way ANOVA

STAT 571A Advanced Statistical Regression Analysis. Chapter 3 NOTES Diagnostics and Remedial Measures

Simple Linear Regression

CHAPTER EIGHT Linear Regression

Chapter 5 Introduction to Factorial Designs Solutions

MS&E 226: Small Data

MATH 644: Regression Analysis Methods

Answer to exercise: Blood pressure lowering drugs

Inferences for Regression

Econometrics I KS. Module 2: Multivariate Linear Regression. Alexander Ahammer. This version: April 16, 2018

This exam contains 5 questions. Each question is worth 10 points. Therefore, this exam is worth 50 points.

y response variable x 1, x 2,, x k -- a set of explanatory variables

Transcription:

Statistics GIDP Ph.D. Qualifying Exam Methodology May 28th, 2015, 9:00am- 1:00pm Instructions: Provide answers on the supplied pads of paper; write on only one side of each sheet. Complete exactly 2 of the first 3 problems, and 2 of the last 3 problems. Turn in only those sheets you wish to have graded. You may use the computer and/or a calculator; any statistical tables that you may need are also provided. Stay calm and do your best; good luck. 1. A dataset is observed based on an experiment to compare three different brands of pens and three different wash treatments with respect to their ability to remove marks on a particular type of fabric. There are four replications in each combination of brands and treatments. The observation averages within each combination are given in the following table. 1) Consider the mean model y "# = μ " + ε "#, where ε "# ~N(0, σ ). MSE is 0.2438. Compute the estimates of μ along with their standard errors. μ = 5.11, μ " = 6.65, μ " = 6.51 μ " = 6.44, μ = 7.92, μ " = 7.24 μ " = 7.01, μ " = 8.15, μ = 7.77 se=sqrt(0.2438/n)=sqrt(0.2438/4)=0.2469 2) Propose an effect model with the interaction effect included; compute the estimate of parameters. The effect model is: y "# = μ + α + β + αβ " + ε "#, i = 1,2,3, j = 1,2,3, k = 1,, 4 with assumptions: α = 0, β = 0, (αβ) " = (αβ) " = 0, and ε "# ~N(0, σ ) σ = 0.2483

μ = 6.98 α = 6.09 6.98 = 0.89 α = 7.2 6.98 = 0.22 α = 7.64 6.98 = 0.66 β = 6.19 6.98 = 0.79 β = 7.57 6.98 = 0.59 β = 7.17 6.98 = 0.19 αβ = 5.11 6.09 6.19 + 6.98 = 0.19 αβ " = 6.65 6.09 7.57 + 6.98 = 0.03 αβ " = 6.51 6.09 7.17 + 6.98 = 0.23 αβ " = 6.44 7.2 6.19 + 6.98 = 0.03 αβ = 7.92 7.2 7.57 + 6.98 = 0.13 αβ " = 7.24 7.2 7.17 + 6.98 = 0.15 αβ " = 7.01 7.64 6.19 + 6.98 = 0.16 αβ " = 8.15 7.64 7.57 + 6.98 = 0.08 αβ = 7.77 7.64 7.17 + 6.98 = 0.06 3) Compute standard errors of the parameter estimates for the main effects in part 2). α = y.. y = y.. ( y.. + y.. + y.. )/3 var α = 2 3 y.. + 1 3 y..+ 1 3 y.. = 4 9 + 1 9 + 1 9 var y.. = 2 3 σ 3 4 = σ 18 Since a=b=3, s.e of each parameter estimate=mse/18=0.0135 4) Complete the following ANOVA table.

source DF SS MS F- values Pen 2 6.0677 3.03385 12.21849 Treatment 2 1.4054 0.7027 2.83 Interaction 4 0.6595 0.164875 0.664 Error 27 6.7041 0.2483 Total 35 14.8367 SS "# = bn SS " = an y.. y y.. y = 3 4 1.2761 = 6.0677 = 3 4 1.0083 = 1.4054 5) (3pts) Compute R-square and test significance of the interaction effect. R- squre=ss model /SS total = 8.1326/14.8367=0.5481408 P- value for interaction=p{f 4, 27 >0.664} >0.05 2. An engineer suspects that the surface finish of a metal part is influenced by the feed rate and the depth of cut. She selects three feed rates and three depths of cut. However, only 9 runs can be made in one day. She runs a complete replicate of the design on each day - the combination of the levels of feed rate and depth of cut is chosen randomly. The data are shown in the following table (dataset Surface.csv is provided). Assume that the days are blocks. Day 1 Day 2 Depth Depth Feed rate 0.15 0.18 0.20 0.15 0.18 0.20 0.2 74 79 82 64 68 88 0.25 92 98 99 86 104 108 0.3 99 104 108 98 99 110 (a) (3 pts) What design is this? Blocking factorial design

(b) (6 pts) State the statistical model and the corresponding assumptions. y "# = μ + τ + β + τβ " + δ + ε "#, i = 1,2,3, j = 1,2,3, k = 1,2 τ = 0, β day (δ k ) is a block factor. = 0, τβ " = τβ " and ε "# ~N(0, σ ) = 0, δ = 0, (c) Make conclusions at false positive rate =0.05 and check model adequacy. The ANOVA table is shown below. Both Depth and Feed factors are significant at alpha=0.05. There is no unusual pattern detected in the QQ normality plot or the residual plot. Source DF Type III SS Mean Square F Value Pr > F Day 1 5.555556 5.555556 0.21 0.6610 Depth 2 560.777778 280.388889 10.46 0.0059 Feed 2 2497.444444 1248.722222 46.58 <.0001 Depth*Feed 4 68.888889 17.222222 0.64 0.6473 (d) What is the difference in randomization for experimental runs between this design and the one in Question 3? In this design: within a day, the combinations of the levels of feed rate and depth of cut are chosen randomly.

In the design of Q3: within a day, a particular mix is randomly selected and then it is applied to a panel by three application methods. (e) Attach SAS/R code data surf; input Day Depth Feed Surface@@; datalines; 1 0.15 0.2 74 1 0.18 0.2 79 1 0.2 0.2 82 1 0.15 0.25 92 1 0.18 0.25 98 1 0.2 0.25 99 1 0.15 0.3 99 1 0.18 0.3 104 1 0.2 0.3 108 2 0.15 0.2 64 2 0.18 0.2 68 2 0.2 0.2 88 2 0.15 0.25 86 2 0.18 0.25 104 2 0.2 0.25 108 2 0.15 0.3 98 2 0.18 0.3 99 2 0.2 0.3 110 ; proc glm data=surf; class Day Depth Feed; model Surface=Day Depth Feed Depth*Feed; output out=myresult r=res p=pred; run; PROC univariate data=myresult normal; var res; qqplot res/normal(mu=0 SIGMA=EST COLOR=RED L=1); run; proc sgplot; scatter x=pred y=res; refline 0; run; 3. An experiment is designed to study pigment dispersion in paint. Four different mixes of a particular pigment are studied. The procedure consists of preparing a particular mix and then applying that mix to a panel by three application methods (brushing, spraying, and rolling). The response measured is the percentage reflectance of the pigment. Three days are required to run the experiment. The data follow (dataset pigment.csv is provided). Assume that mixes and application methods are fixed.

(a) (3 pts) What design is this? Split-plot design (b) (6 pts) State the statistical model and the corresponding assumptions. y "# = μ + γ + α + (γα) " + β + αβ " + ε "#, i = 1,,3, j = 1,,4, β k = 1,,3 γ ~N 0, σ, α = 0, (αβ) " = 0, (γα) ~N 0, σ " = (αβ) " = 0, and ε "# ~N(0, σ ) (c) Make conclusions and check model adequacy. ANOVA table is shown below. Both Mix and Method factors are significant at alpha=0.05, and their interaction is slightly significant (pv-value =0.06). There is no unusual pattern noticed in QQ normality plot or residual plot. Type 3 Tests of Fixed Effects Effect Num DF Den DF F Value Pr > F Mix 3 6 135.77 <.0001 Method 2 16 165.30 <.0001 Mix*Method 6 16 2.49 0.0678

(d) Now assume the application methods are random while the other terms are kept same as before. State the statistical model and the corresponding assumptions using the unrestricted method; reanalyze the data. y "# = μ + γ + α + (γα) " + β + αβ " + ε ", i = 1,,3, j = 1,,4, k = 1,,3 γ ~N 0, σ, α β ~N 0, σ, αβ " ~ N 0, σ " = 0, (γα) ~N 0, σ ", and ε "# ~N(0, σ ) The ANOVA analysis shows that only the fixed effect Mix is significant. Type 3 Tests of Fixed Effects Effect Num DF Den DF F Value Pr > F Mix 3 6 58.37 <.0001 The covariance component estimates for all random effects are shown below, no effect is significant at alpha=0.05. From the QQ plot and residual plot all assumptions are met. Covariance Parameter Estimates Cov Parm Estimate Standard Error Z Value Pr > Z Alpha Lower Upper Day 0.02216 0.09250 0.24 0.4053 0.05 0.001987 1.767E25 Mix*Day 0.02770 0.1655 0.17 0.4335 0.05 0.002525 1.934E54 Method 9.1146 9.2543 0.98 0.1623 0.05 2.4387 396.94 Mix*Method 0.3336 0.3315 1.01 0.1571 0.05 0.09094 12.6598

Covariance Parameter Estimates Cov Parm Estimate Standard Error Z Value Pr > Z Alpha Lower Upper Residual 0.6718 0.2375 2.83 0.0023 0.05 0.3726 1.5561 And the residual plots: (e) Attach SAS/R code data pigment; input Mix Method Day Resp@@; datalines; 1 1 1 64.5 1 2 1 68.3 1 3 1 70.3 1 1 2 65.2 1 2 2 69.2 1 3 2 71.2 1 1 3 66.2 1 2 3 69 1 3 3 70.8 2 1 1 66.3 2 2 1 69.5 2 3 1 73.1 2 1 2 65 2 2 2 70.3 2 3 2 72.8 2 1 3 66.5 2 2 3 69 2 3 3 74.2 3 1 1 74.1 3 2 1 73.8 3 3 1 78 3 1 2 73.8 3 2 2 74.5

3 3 2 79.1 3 1 3 72.3 3 2 3 75.4 3 3 3 80.1 4 1 1 66.5 4 2 1 70 4 3 1 72.3 4 1 2 64.8 4 2 2 68.3 4 3 2 71.5 4 1 3 67.7 4 2 3 68.6 4 3 3 72.4 ; /* proc mixed => Stat model 2: only 5 terms included (rest terms are pooled as the random error term) */ proc mixed data=pigment method=type1; class Mix Method Day; model Resp=Mix Method Mix*Method/outp=predicted; random Day Day*Mix; run; PROC univariate data=predicted normal; var Resid; qqplot Resid /normal(mu=0 SIGMA=EST COLOR=RED L=1); run; proc sgplot; scatter x=pred y=resid; refline 0; run; /* part d)*/ proc mixed data=pigment CL covtest; class Mix Method Day; model Resp=Mix /outp=predicted; random Day Day*Mix Method Mix*Method; run; PROC univariate data=predicted normal; var Resid; qqplot Resid /normal(mu=0 SIGMA=EST COLOR=RED L=1); run; proc sgplot; scatter x=pred y=resid; refline 0; 4. An intriguing use of loess smoothing for enhancing residual diagnostics employs the method to verify, or perhaps call into question, indications of variance heterogeneity in a residual plot. From a regression fit (of any sort: SLR, MLR, loess, etc.) find the absolute residuals e i, i = 1,...,n. To these, apply a loess smooth against the fitted values Ŷi. If the loess curve for the e i s exhibits departure from a horizontal line, variance heterogeneity is

indicated/validated. If the smooth appears relatively flat, however, the loess diagnostic suggests that variation is not necessarily heterogeneous. Apply this strategy to the following data: Y ={career batting average} (a number between 0 and 1, reported to three digit-accuracy) recorded as a function of X = {number of years played} for n = 322 professional baseball players. (The data are found in the file baseball.csv.) Plot the absolute residuals from a regression fit and overlay the loess smooth to determine whether or not the loess smooth suggests possible heterogeneous variation. Use a second-order, robust smooth. Explore the loess fit by varying the smoothing parameter over selected values in the range 0.25 q 0.75. A1. Always plot the data Sample R code: baseball.df = read.csv( file.choose() ) attach( baseball.df ) Y = batting.average X = years plot( Y ~ X, pch=19 ) The plot indicates an increase in Y = batting average as X = years increases, so consider a simple linear regression (SLR) fit. For the loess smooth, first find the absolute residuals and fitted values from the SLR fit absresid = abs( resid(lm( Y~X )) ) Yhat = fitted( lm( Y~X ) ) then apply loess (use a second-order, robust smooth, to allow for full flexibility). Try the default smoothing parameter of q = 0.75:

baseball75.lo = loess( absresid~yhat, span = 0.75, degree = 2, family='symmetric' ) Ysmooth75 = predict( baseball75.lo, data.frame(yhat = seq(min(yhat),max(yhat),.001)) ) Plot e i against Ŷi and overlay the smooth: plot( absresid~yhat, xlim=c(.25,.29), ylim=c(0,.11) ) par( new=true ) plot( Ysmooth75~seq(min(Yhat),max(Yhat),.001), type='l', lwd=2, xaxt='n', yaxt='n', xlab='', ylab='', xlim=c(.25,.29), ylim=c(0,.11) ) For comparison, the second-order, robust smooth at q = 0.33 gives: baseball33.lo = loess( absresid~yhat, span = 0.33, degree = 2, family='symmetric' ) Ysmooth33 = predict( baseball33.lo, data.frame(yhat = seq(min(yhat),max(yhat),.001)) ) plot( absresid~yhat, xlim=c(.25,.29), ylim=c(0,.11) ) par( new=true ) plot( Ysmooth33~seq(min(Yhat),max(Yhat),.001), type='l', lwd=2, xaxt='n', yaxt='n', xlab='', ylab='', xlim=c(.25,.29), ylim=c(0,.11) )

which gives a more jagged smoothed curve (as would be expected). Also, the second-order, robust smooth at q = 0.50 yields: baseball50.lo = loess( absresid~yhat, span = 0.50, degree = 2, family='symmetric' ) Ysmooth50 = predict( baseball50.lo, data.frame(yhat = seq(min(yhat),max(yhat),.001)) ) plot( absresid~yhat, xlim=c(.25,.29), ylim=c(0,.11) ) par( new=true ) plot( Ysmooth50~seq(min(Yhat),max(Yhat),.001), type='l', lwd=2, xaxt='n', yaxt='n', xlab='', ylab='', xlim=c(.25,.29), ylim=c(0,.11) )

which appaers less jagged (again, as would be expected) and more similar to the loess curve at q = 0.75. From a broader perspective, all the smoothed loess curves suggest a fairly flat relationship, so the issue of variance heterogeneity may not be critical. (Further investigation would be warranted.) The use of loess in this fashion is from Cleveland, W. S. (1979). Robust locally weighted regression and smoothing scatterplots. Journal of the American Statistical Association 74(368), 829-836. The data are from Sec. 3.8 of Friendly, M. (2000). Visualizing Categorical Data. Cary, NC: SAS Institute, Inc.

5. Suppose you fit a simple linear regression model to data Y i ~ indep. N(α + βx i, σ 2 ); i=1,...,n. (a) Let ˆβ be the usual least squares (LS) estimator of β. State the distribution of ˆβ. (b) Let S 2 be the MSE = n (Y i Ŷi) 2 /(n 2), where 1 Ŷi = ˆα + ˆβx i and ˆα is the LS estimator for α. Recall that S 2 is known to be an unbiased estimator for σ 2. State a result involving the χ 2 distribution that involves S 2 and σ 2. What one important statistical relation (in terms of probability features) exists between this and the result you state in part (a)? (c) Find an unbiased estimator for the ratio β σ. A2: For simplicity, let d = n 2 and v = 1/ n 1 (x i x) 2. (a) ˆβ ˆβ ~ N(β, σ 2 β v). Notice that Z = σ v ~ N(0,1). (b) W = S2 d σ 2 ~ χ 2 (d), where d = n 2. This is statistically independent of ˆβ (and Z) in part (a). ˆβ β (c) Given Z = σ v S 2 d ~ N(0,1), independent of W = σ 2 ~ χ 2 (d), where d = n 2. Note that since 1 0 Γ(d/2)2 d/2 w (d/2) 1 e w/2 dw = 1, we know (for a = d/2) 0 wa 1 e w/2 dw = 2 a Γ(a). Thus, e.g., E[W b ] = 0 1 Γ(d/2)2 d/2 w(d/2) b 1 e w/2 dw = = Γ({d/2} b)2(d/2 ) b Γ({d/2} b) d/2 = 2 b Γ(d/2)2 Γ(d/2) for b > 0. In particular, E[W (1/2) ] = E σ S d = Γ({d 1}/2) Γ(d/2) 2. 1 Γ(d/2)2 d/2 0 w(d/2) b 1 e w/2 dw

Now, since Z and W are independent, T = ˆβ β σ v ˆβ β = S 2 d σ v σ 2 d σ ˆβ β S = S v ~ t(d), so E ˆβ β Z ~ t(d) and so for d > 1, E[T] = 0. This can be written as W/d S v = 0. That is, E ˆβ S v = E β S v = E σ S d = β d Γ({d 1}/2). Now multiply both sides by v to find σ v Γ(d/2) 2 βv E 1 S = β d σ v E ˆβ S = v β d Γ({d 1}/2) σ v Γ(d/2) 2 = β σ d 2 Γ({d 1}/2). Γ(d/2) Therefore, an unbiased estimator for β σ is β S Γ(d/2) Γ({d 1}/2) 2, where d = n 2. d

6. In the study of weathering in mountainous ecosystems, it is expected that silicon weathers away in soil as ambient temperature increases. In an experiment to study this, data were recorded on loss of silicon in soil at four independent sites over differing temperature conditions. These were: Temp. ( C+5) 3 5 10 Silicon conc. (mg/kg) 132.1 73.2 3.5 136.7 63.1 17.7 146.8 52.8 9.2 126.1 51.7 1.9 Assume that the observations satisfy Y ij ~ indep. N(µ{t i },σ 2 ), i=1,,3, j=1,,4, where t i are the 3 temperatures under study and µ{t i } is some function of t i. Using linear regression methods, find a model that fits these data both as reasonably and as parsimoniously as possible. (This question is purposefully open-ended.) From your fit, perform a test to assess the hypotheses of no effect, vs. some effect, due to temperature change in these sites. Set your false positive error rate to 0.05. A3. Always plot the data Sample R code: silicon.df = read.csv( file.choose() ) attach( silicon.df ) Y = conc t = temp plot( Y ~ t, pch=19 ) The plot indicates a decrease in Y = silicon concentration as X = temperature increases, so consider a simple linear regression (SLR) fit and (first) check the residual plot:

siliconslr.lm = lm( Y ~ t ) plot( resid(siliconslr.lm)~t, pch=19 ); abline( h=0 ) The resid. plot indicates a clear pattern (also evident from a close look at the scatterplot): so a SLR model gives a poor fit. With the observed pattern in the residuals, the obvious thing to try next is a quadratic model: siliconqr.lm = lm( Y ~ t + I(t^2) ) plot( resid(siliconqr.lm)~t, pch=19 ); abline( h=0 )

The resid. plot here indicates a better fit, with possibly a slight decrease in variation at higher temperatures (i.e., slightly heterogeneous variance). But first, overlay the fitted model on the original data: bqr = coef( siliconqr.lm ) plot( Y ~ t, pch=19, xlim=c(3,10), ylim=c(0,150) ); par( new=t ) curve( bqr[1] + x*(bqr[2] + x*bqr[3]), xlim=c(3,10), ylim=c(0,150), ylab='', xlab='' )

The ever-present danger with a quadratic fit is evident here: the very good fit also comes with the unlikely implication that the mean response turns back up before we reach the highest observed temperature. (Quick reflection suggests that this is hard to explain: it is reasonable for the soil to lose silicon as temperature rises, but then how could it regain the silicon as the temperature starts to rise even higher?) So, start again: since the simple linear model fails to account for curvilinearity in the data, try a transformation. The logarithm is a natural choice: U = log(y); plot( U ~ t, pch=19 ) siliconlog.lm = lm( U~t ) plot( resid(siliconlog.lm)~t, pch=19 ); abline( h=0 )

Some improvement is shown in the residuals, but the curvilinearity may still be present and there is now clear variance heterogeneity. So, try a quadratic linear predictor again, but now against U = log(y), and also apply weighted least squares (WLS) to account for the heterogeneous variances. For the WLS fit, the per-temperature replication makes choice of the weights easy: use reciprocals of the sample variances at each temperature. s2 = by(data=silicon.df$conc, INDICES=factor(silicon.df$temp), FUN=var) w = rep( 1/s2, each=4 ) siliconqlog.lm = lm( U~t+I(t^2), weight=w ) plot( resid(siliconqlog.lm)~t, pch=19 ); abline( h=0 ) We don t see much change in the residual plot (of course, we don't expect to: the theory tells us that the inverse-variance weighting will nonetheless adjust for any variance heterogeneity). An overlay of the (back-transformed) fitted model to the data shows a much more-sensible picture: bqlog = coef( siliconqlog.lm ) plot( Y ~ t, pch=19, xlim=c(3,10), ylim=c(0,150) ); par( new=t ) curve( exp(bqlog[1] + x*(bqlog[2] + x*bqlog[3])), xlim=c(3,10), ylim=c(0,150), ylab='', xlab='' )

So, proceed with this model, where E[log(Y i )] = β 0 + β 1 t i + β 2 t i 2. The hypotheses of no effect due to temperature is H o : β 1 = β 2 = 0. (The alternative in H a is any difference.) Test this via (output edited) summary( siliconqlog.lm ) Call: lm(formula = U ~ t + I(t^2), weights = w) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 6.010783 1.762351 3.411 0.00774 t -0.342931 0.654799-0.524 0.61312 I(t^2) -0.008346 0.048662-0.172 0.86761 Residual standard error: 0.08094 on 9 degrees of freedom Multiple R-squared: 0.8554, Adjusted R-squared: 0.8232 F-statistic: 26.61 on 2 and 9 DF, p-value: 0.0001664 The pertinent test statistic here is the full F-statistic of F calc = 26.61 with (2,9) d.f. (given at bottom of output). The corresp. P-value is P = 1.7 10 4, which is well below 0.05. Thus we reject H o and conclude that under this model, there is a significant effect due to temperature on (log)silicon concentration. Notice, by the way, that besides β 0, neither individual regression parameter is significant based on its 1 d.f. partial t-test. This is due (not surprisingly) to the heavy multicollinearity underlying this quadratic regression; the VIFs are both far above 10.0: require( car ) vif( siliconqlog.lm )

t I(t^2) 110.312 110.312 Indeed, we should formally center the temperature variable before conducting the quadratic fit. Notice, however, that the full F-statistic (and hence its P-value) does not change (output edited): tminustbar = scale( t, scale=f ) summary( lm(u~tminustbar+i(tminustbar^2), weight=w) ) Residual standard error: 0.08094 on 9 degrees of freedom Multiple R-squared: 0.8554, Adjusted R-squared: 0.8232 F-statistic: 26.61 on 2 and 9 DF, p-value: 0.0001664