Solutions - Homework #2

Size: px

Start display at page:

Download "Solutions - Homework #2"

Priscilla Greene
5 years ago
Views:

1 Solutions - Homework #2 1. Problem 1: Biological Recovery (a) A scatterplot of the biological recovery percentages versus time is given to the right. In viewing this plot, there is negative, slightly nonlinear relationship between recovery percentage and time. There also appears to be more variability in the recoveries at early times in that the relationship is most clearly defined at greater times. Hence, there appears to be some variance heterogeneity present. Biological Recovery Percentage Scatterplot of Recovery Percentage vs. Time (b) A scatterplot of the log biological recovery percentages versus time is given to the right. In viewing this plot, the relationship is still negative as before, but is now fairly linear. Additionally, the amount of variability about this linear pattern appears to be homogeneous over all times indicating variance homogeneity for this relationship. Hence, a simple linear regression model on this log scale seems more appropriate than fitting a nonlinear model on the original scale due to the variance heterogeneity on the raw scale. (c) The simple linear regression model y i = β +β 1 x i +ǫ i (with y = log biological recovery and x = time) was fit to these data, producing the following parameter estimates Log Biological Recovery Percentage Time (in minutes) Scatterplot of Log Recovery vs. Time Time (in minutes) for the intercept and slope: β = & β 1 =.368. The R 2 -statistic as reported from the MatLaboutputwas R 2 =.9287,meaningthat92.9% of the variation in log recovery percentages was explained by the model on time. The relevant output from the tstats regression output structure is shown below and all code used in this problem is given at the end of these solutions. Coefficients: Value Std. Error t value p-value Intercept <.5 times <.5 Multiple R-Squared:.9287 F-statistic: on 1 and 11 degrees of freedom, p-value = e-7 (d) An estimate of σ 2 is given by the mean square error (MSE), which is reported in MatLab as out.mse in the out regression structure. Hence, MSE =.431. The standard errors of β and β 1 are given in the Coefficients table above as: SE( β ) =.188 & SE( β 1 ) =.31. Using these standard errors and computing the t-critical value with n p = 13 2 = 11 degrees 1

2 of freedom at the 99% level (t.995 (11) = 3.158), individual 99% confidence intervals for β & β 1 were computed via MatLab as: For β : β ±t 11 SE( β ) = ±3.158(.188)= ±.3379 (3.63, 4.3). For β 1 : β1 ±t 11 SE( β 1 ) =.368±3.158(.31)=.368±.96 (-.464, -.273). Hence, we are 99% confident that the slope of the regression of log recovery percentage on time is between 3.63 and 4.3 log(%) per minute. We are also (individually) 99% confident that the log recovery percentage at time is between and log(%). Since neither confidence interval contains the value, then both β and β 1 appear to be significant (significantly different from ) in this model. (e) The confidence bands were computed using the confregplot function from the course webpage, as shown in the scatterplot to the right below with the fitted regression line. The prediction bands are also plotted. To get at the gains in precision in estimating the mean of y for different values of x, the table to the left below gives the margins of error in the confidence interval for E(y) for several values of x. As expected, the further we get from the mean x-value of 3, the greater the variability in estimating E(y). To be 5 more precise, the margin of error at either 4.5 extreme of the times ( or 6) is roughly 41% larger (.29 to.41) than that at the 4 middle time (3 minutes). x Margin of Error Log Recovery Percentage Fitted Line Confidence Bands Prediction Bands Time (in minutes) 2. Problem 2: Logistic reparameterization: Suppose we begin with the logistic growth model parameterized as: Mu u(t) = u +(M u )exp( kt) where (M,u,k) are the model parameters. First, divide all terms in this model by u. Doing so gives: u(t) = = M ( ) M u 1+ exp( kt) u M [ ( M 1+exp log 1 u ) ] kt ( since M [ ( )]) M 1 = exp log 1. u u ( ) M M Defining a = log 1 and b = k, this can be written: u(t) =, as desired. u 1+exp(a+bt) 2

3 3. Problem 3: Weibull fit problem 14 Growth vs. Time (a) A scatterplot of the growth amounts vs. time is given to the right. In viewing this plot, the growth amounts initially increase very rapidly but seem to level off around 1, orgs./.5ml. It is also worth noting that there seems to be more variability in these growth amounts as their values increase. The resulting pattern may be well-described by an exponential growth curve, although we may require the additional flexibility provided by the Weibull model. Growth (orgs./ml) (b) Using the Weibull growth model parameterized as: y i = α{1 exp[ (t i /σ) γ ]}+ǫ i, Time (in days) we first recognize that α is the upper asymptote of the curve, since as time increases, the exponential piece goes to zero. Eyeballing where this limit occurs from the scatterplot, we choose α = 11, as the starting value for α in a nonlinear least squares fit. Picking two points from the scatterplot, we can see that (x,y) = (2,248) & (6,944) roughly fit the curved pattern seen. Substituting these values into the Weibull model gives the following pair of equations: 248 = 11{1 exp[ (2/σ) γ ]} and 944 = 11{1 exp[ (6/σ) γ ]}. Solving this system of two equations in the two unknowns (σ,γ) gives: 852 = 11exp[ (2/σ) γ ] 156 = 11exp[ (6/σ) γ ] = ln(852/11)= (2/σ)γ ln(156/11)= (6/σ) γ (dividing the equations) [ ] ln(852/11) = γ = ln /ln(1/3) = 1.85 ln(156/11) = ln( ln(156/11)) = ln(6/σ) γ = σ = 6exp[ ln( ln(156/11))/γ] = ln(852/11) ln(156/11) = 2γ 6 γ = 6 exp[ ln( ln(156/11))/1.85] = Hence, the starting values I used to fit the Weibull model were: (α,σ,γ) = (11,4.18,1.85). (c) With these starting values found above, nlinfit was used to fit a Weibull growth curve to these data, resulting in the fitted curve plotted above. The resulting parameter estimates are: α = 1,369, σ = , γ = (d) A residual plot and normal quantile plot are shown below. In viewing these plots, the residual plot exhibits a potential pattern with most of the data appearing at the largest predicted values. However, this is a byproduct of having roughly 1 of the values at the asymptote of 1,. The points at this largest predicted value do have more variability possibly indicating variance heterogeneity in the larger growth values. The normal quantile plot shows a reasonably linear relationship between the residuals and the standard normal quantiles, indicating no serious departures from normality for the residuals. 3

4 3 Residual Plot 3 Normal Quantile Plot Residuals 1 5 Residuals Predicted Values Standard Normal Quantiles (e) The nlparci function in MatLab was used to find individual 95% confidence intervals for the three model parameters. The resulting intervals are reported below along with the by hand calculations using the standard errors reported in the next part. For each confidence interval to be at the 95% level individually, since there are n = 16 data pairs and p = 3 parameters being estimated, there are n p = 13 degrees of freedom available. Under the assumption that the sampling distributions of the parameter estimates are normal, we use a t-based confidence interval for α,σ, and γ. The critical t-value is: t = t 13 (.975) = as found via MatLab. This resulted in the following set of 3 individual 95% confidence intervals for the 3 parameters: For α : α±t SE( α) = 1369±2.164( )= 1369±74 = (9629,1119), For σ : σ ±t SE( σ) = ±2.164(.2939)= ±.635 = (3.3526,4.6226), For γ : γ ±t SE( γ) = ±2.164(.5892)= ±1.273= (1.211,3.7569). The confidence interval for α can be interpreted as: We are 95% confident that the true value for α in the Weibull model relating growth amount to time is between 9629 and More preceisely, we are 95% confident that the maximum growth reached is between 9629 and 1119 orgs./ml. The others can be interpreted similarly. To have 3 confidence intervals simultaneously at the 95% level, we need to use the Bonferroni correction as discussed briefly in class. To do this, instead of finding the critical t-value at the.975 percentile of the t-distribution with 13 degrees of freedom, we divide the lower tail (.25) by 3 (the number of CIs desired) to get a tail probability of.25/3 =.83 and then compute t.9917 (13) to get the critical value. Doing so gives t.9917 (13) = and a wider set of CIs: For α : (9428,1131) For σ : (3.181,4.795) For γ : (.866,4.12) in which we are simultaneously 95% confident that the three parameters fall inside their respective intervals. (f) As indicated in the previous part, the operative t-critical value is: t = t 13 (.975) = The lengths of the three confidence intervals divided by 2 give the margins of error for the three intervals. Dividing these margins of error by t gives the standard errors for each of the three parameter estimates. These calculations are summarized in the table below. 4

5 Parameter Estimate CI Margin of Error Standard Error α 1369 (9629,1119) ( )/2 = 74 74/t = σ (3.3526,4.6226) ( )/2= /t =.2939 γ (1.211,3.757) ( )/2= /t =.5893 (g) Since the confidence interval for γ clearly does not include the value γ = 1, then we have evidence at the.5-level that γ differs from 1 and that this extra Weibull parameter is significant in the model. If we instead consider the Bonferroni-corrected 95% confidence intervals (which is really the correct thing to do), since the CI for γ contains the value 1 (albeit barely), we would conclude that this parameter is unnecessary in the model If we omit this parameter, which we are justified in doing, we are left with the 2-parameter exponential growth model which was fully discussed in class. We would then conclude that the model is overspecified. clear all; MatLab Code Used % ============================================================ % % Problem 1: Plot of biological recovery percentages vs. time % % ============================================================ % load../data/cloud.mat % Reads in the data time = cloud.time; % Renames the time variable recov = cloud.recovery; % Renames the recovery variable figure(1) plot(time,recov, ko ) % Plots recoveries vs. time xlim([-5 65]) % Sets x-limits on plot xlabel( Time (in minutes), fontsize,14) ylabel( Biological Recovery Percentage, fontsize,14) title( Scatterplot of Recovery Percentage vs. Time, fontsize,14, fontweight, b ) figure(2) logrecov = log(recov); % Log of recoveries plot(time,logrecov, ko ) % Plots log recovery vs. time xlim([-5,65]) % Sets x-limits on plot xlabel( Time (in minutes), fontsize,14) ylabel( Log Biological Recovery Percentage, fontsize,14) title( Scatterplot of Log Recovery vs. Time, fontsize,14, fontweight, b ) % ===================================================================== % % Problem 1: Regression of log biological recovery percentages vs. time % % ===================================================================== % out = regstats(logrecov,time); % Regresses recovery (y) on time (x) out.tstat % Requests relevant parameter estimate info out.rsquare % Multiple R-Squared Value out.mse % Mean Squared Error (MSE) out.fstat % Model F-Statistic, df, pval n = length(time); % Sample size p = 2; % Defines # parameters bhat = out.beta; % Vector of parameter estimates 5

6 seb = sqrt(diag(out.covb)); tbonf = tinv((1-.5),n-p); ci_b = [bhat-tbonf*seb,... bhat+tbonf*seb]; ci_b % Vector of standard errors % 99% uncorrected t*-value % Confidence intervals for betas % in 2 columns (lower,upper) % Prints confidence intervals % ================================================= % % Problem 1: Confidence bands for E(log recoveries) % % ================================================= % xlab = Time (in minutes) ; % X-axis label ylab = Log Recovery Percentage ; % Y-axis label % The confregplot function plots the confidence and prediction % bands for E(y). This function requires as inputs the x & y % variables, labels for these variables, and the confidence level. confregplot(time,logrecov,xlab,ylab,99) % =============================================== % % Problem 3: Weibull fit to Growth Data with plot % % =============================================== % load../data/paramecium.mat; % Loads the growth data growth = paramecium.growth; % Defines growth vector time = paramecium.time; % Defines time vector plot(time,growth, ko ) % Plots growth vs. time xlim([ 16]) % x-axis plotting limits xlabel( Time (in days), fontsize,14, fontweight, b ); ylabel( Growth (orgs./ml), fontsize,14, fontweight, b ); title( Growth vs. Time, fontsize,14, fontweight, b ); hold on; % Hold the current plot % ======================================================= % % Problem 3c - Nonlinear Weibull model fit to growth data % % ======================================================= % beta1 = [ ]; % Parameter starting values [betahat1 resid1 J1] = nlinfit(time,... % Performs nonlinear Weibull fit growth,@weibull,beta1); % returning betahats, resids time1 = :.1:16; % Vector of times from to 16 yhat1 = betahat1(1)*(1-exp(-(time1./... % Computes Weibull predicted betahat1(2)).^betahat1(3))); % values (yhat s) plot(time1,yhat1); % Plots the fitted line hold off; % End hold on current plot nlintool(time,growth,@weibull,beta1); % Plots 95% confidence bands % =========================================================== % % Problem 3d - Residual and normal quantile plot of residuals % % =========================================================== % figure(1) % 1st Figure yhat.weib = growth - resid1; % Computes Weibull predicted values plot(yhat.weib,resid1, ko ); % Plots residuals vs. predicted y-values xlabel( Predicted Values, fontsize,14, fontweight, b ); 6

7 ylabel( Residuals, fontsize,14, fontweight, b ); title( Residual Plot, fontsize,14, fontweight, b ); figure(2); % 2nd Figure qqplot(resid1); % Normal quantile plot of residuals xlabel( Standard Normal Quantiles, fontsize,14, fontweight, b ); ylabel( Residuals, fontsize,14, fontweight, b ); title( Normal Quantile Plot, fontsize,14, fontweight, b ); % ======================================================================= % % Problem 3e,f - Individual 95% Confidence intervals for the 3 parameters % % ======================================================================= % ci = nlparci(betahat1,resid1,j1); % Computes CIs for (alpha,sigma,gamma) moe = (ci(:,2)-ci(:,1))/2; % Margin of error from CIs se = moe/tinv(.975,13); % Standard errors from CIs 7

Solutions - Homework #2

Solutions - Homework #2 45 Scatterplot of Abundance vs. Relative Density Parasite Abundance 4 35 3 5 5 5 5 5 Relative Host Population Density Figure : 3 Scatterplot of Log Abundance vs. Log RD Log Parasite Abundance 3.5.5.5.5.5