PARTICLE COUNTS MISLEAD SCIENTISTS: A JMP AND SAS INTEGRATION CASE STUDY INVESTIGATING MODELING

Size: px

Start display at page:

Download "PARTICLE COUNTS MISLEAD SCIENTISTS: A JMP AND SAS INTEGRATION CASE STUDY INVESTIGATING MODELING"

Barrie Little
5 years ago
Views:

1 PARTICLE COUNTS MISLEAD SCIENTISTS: A JMP AND SAS INTEGRATION CASE STUDY INVESTIGATING MODELING Kathleen Kiernan, Diane Michelson, PhD, Annie Zangi, SAS Many researchers and statisticians deal with data that do not meet the assumptions of normality or homogeneity of variance. Research has shown parametric tests such as regression or ANOVA can be robust to modest violations of these assumptions. Data transformations are commonly used tools that can provide methods to improve normality of a distribution and equalize variance to meet assumptions. There are many potential types of data transformations. Some of the more traditional types and common types of data transformations include square root, logarithmic, natural log, and inverse transformations for improving normality. However, the Box-Cox transformation (1964) represents a family of power transformations that include some of the traditional transformations to help researchers find the optimal normalizing transformation for each variable. That said, engineers and researchers find the Box-Cox transformation represents a good technique where normalizing data or equalizing variance is desired. This paper will provide an overview of the Box-Cox transformation and address specific questions relating to using the Box-Cox transformation in statistical analysis. Examples of applications of the Box-Cox transformation are presented including details how to obtain a Box-Cox transformation using JMP/SAS Integration Technology. Many papers and research has been done since the original Box-Cox transformation paper (1964) that provides a family of transformations that will optimally normalize a particular variable, eliminating the need to randomly try different transformations to determine the best option. Perhaps Box and Cox (1964) originally envisioned this transformation as a solution for simultaneously correcting normality, linearity, and homoscedasticity. While these transformations often improve all of these aspects of a distribution or analysis, Sakia (1992) and others have noted it does not always accomplish these challenging goals. Many statistical procedures make two assumptions that are relevant to this topic: (a) an assumption that the variables (or the errors (residuals), more technically correct) are normally distributed, and (b) an assumption of homoscedasticity or homogeneity of variance, meaning that the variance of the variable remains constant over the observed range of some other variable. In regression analyses this second assumption is that the variance around the regression line is constant across the entire observed range of data. In ANOVA analyses, this assumption is that the variance in one cell is not significantly different from that of other cells. Most statistical software packages provide ways to test both assumptions. Significant violation of either assumption can increase your chances of committing either a Type I or II error (depending on the nature of the analysis and violation of the assumption). Yet few researchers test these assumptions, and fewer still report correcting for violation of these assumptions. This is unfortunate, given that in most cases it is relatively simple to correct this problem through the application of data transformations or fit statistical models to data with correlations or nonconstant variability and where the

2 response is not necessarily normally distributed. Even when one is using analyses considered robust to violations of these assumptions or non-parametric tests (that do not explicitly assume normally distributed error terms), attending to these issues can improve the results of the analyses (e.g., Zimmerman, 1995). How does one test assumption of normality and homogeneity of variance? The following information taken from the UNIVARIATE chapter of the SAS Procedures Guide explains how to interpret the results of the NORMAL option tests: You determine whether to reject the null hypothesis by examining the probability that is associated with a test statistic. When the p-value is less than the predetermined critical value (alpha value), you reject the null hypothesis and conclude that the data does not came from the theoretical distribution. If you want to test the normality assumptions that underlie analysis of variance methods, beware of using a statistical test for normality alone. A test's ability to reject the null hypothesis (known as the power of the test) increases with the sample size. As the sample size becomes larger, increasingly smaller departures from normality can be detected. Since small deviations from normality do not severely affect the validity of analysis of variance tests, it is important to examine other statistics and plots to make a final assessment of normality. The skewness and kurtosis measures and the plots that are provided by the PLOTS option, the HISTOGRAM statement, PROBPLOT statement, and QQPLOT statement can be very helpful. For small sample sizes, power is low for detecting larger departures from normality that may be important. To increase the test's ability to detect such deviations, you may want to declare significance at higher levels, such as 0.15 or 0.20, rather than the often-used 0.05 level. Again, consulting plots and additional statistics will help you assess the severity of the deviations from normality. Traditional data transformations used to improve normality This particular section of the paper will briefly highlight some points regarding some of the commonly used transformations. Square root transformations are traditionally thought of as good for normalizing Poisson distributions (most common with data that are counts of occurrences, such as number of times an insurance claim is filed for specific age group and type of car or the famous example of the number of soldiers in the Prussian Cavalry killed by horse kicks each year (Bortkiewicz, 1898) and equalizing variance. Log transformations are actually a class of transformations, rather than a single transformation, and in many fields of science log-normal variables (i.e., normally distributed after log transformation) are relatively common. Log-normal variables seem to be more common when outcomes are influenced by many independent factors (e.g., biological outcomes).

3 The Arcsine transformation has traditionally been used for proportions, (which range from 0.00 to 1.00), and involves of taking the arcsine of the square root of a number, with the resulting transformed data reported in radians. Because of the mathematical properties of this transformation, the variable must be transformed to the range 1.00 to While a perfectly valid transformation, other modern techniques may limit the need for this transformation. For example, rather than aggregating original binary outcome data to a proportion, researchers can use logistic regression on the original data. Box- Cox transformation. You may have noticed that many potential transformations, including several discussed above, are all members of a class of transformations called power transformations. Power transformations are merely transformations that raise numbers to an exponent (power). For example, a square root transformation can be characterized as x 1/2, inverse transformations can be characterized as x -1 and so forth. Tukey (1957) is often credited with presenting the initial idea that transformations can be thought of as a class or family of similar mathematical functions. This idea was modified by Box and Cox (1964) and is used to find potentially nonlinear transformations of a dependent variable. The Box-Cox transformation has the form y λ-1 / λ λ 0 log(y) λ=0 This family of transformations of the positive dependent variable y is controlled by the parameter λ. Transformations linearly related to square root, inverse, quadratic, cubic, and so on are all special cases. The limit as λ approaches 0 is the log transformation. More generally, Box-Cox transformations of the following form can be fit: ( (y+c) λ - 1 ) / (λg) λ 0 log(y+c)/g λ=0 By default, c=0. The parameter c can be used to rescale y so that it is strictly positive. By default, g=1. Alternatively, g can be y λ-1, where y is the geometric mean of y. Both JMP and SAS estimate lambda λ, the Box-Cox transformation coefficient using a maximum likelihood approach and to estimate the effects of a selected range of λ automatically. Given that λ can take on an almost infinite number of values, we can theoretically calibrate a transformation to be maximally effective in moving a variable toward normality, regardless of whether it is negatively or positively skewed. Additionally, as mentioned above, this family of transformations incorporates many traditional transformations: λ = 1.00: no transformation needed; produces results identical to original data λ = 0.50: square root transformation λ = 0.33: cube root transformation λ = 0.25: fourth root transformation

4 λ = 0.00: natural log transformation λ = -0.50: reciprocal square root transformation λ = -1.00: reciprocal (inverse) transformation and so forth Example 1 (Spin Rinse Dryer Optimization Experiment): In this example, engineers were trying to determine process settings to minimize delta particles. Wafers are loaded into a spin rinse dryer and processing begins. Deionized (DI) water is sprayed on the wafers while they are spun for a certain length of time, at a certain spin speed. Wafers are then dried by spraying with low pressure nitrogen. Particle counts are measured on each wafer before and after the clean on two wafers, and the average is reported. The factors being considered are: rinse time, spin speed (lrpm), drying time, and nitrogen pressure blast. As the design was a full-factorial, they were able to model all two-factor interactions and main effects, which yielded the results in Table 1. Sorted Parameter Estimates Term Estimate Std Error t Ratio Prob> t n2press(10,60) <.0001* drytime(60,300) lrpm*n2press lrpm(2,2.903) lrpm*drytime drytime*n2press rinsetime*lrpm rinsetime*drytime rinsetime*n2press rinsetime(60,300) Table 1. Problem: These results caused the engineers to jump to invalid conclusions. Because both dry time, and nitrogen pressure were significant positive effects, and the interaction was also positive, they changed the process to be centered around higher values of both time and pressure. However, the new results weren t improved. A closer look: One option in Fit Model to evaluate the fit and error structure is the Box Cox Transformation. When we select this option, we see the graph in Figure 1 below. In this graph, we see a transformation would indeed be appropriate here. The graph is a plot of SSE vs lambda for various values of lambda. The model with the lowest SSE is judged best. The red line gives a confidence limit on the lambdas. Models with SSE below the red line are all the same. Lambda=1 indicates no transformation necessary. Lambda=0 indicates log transformation. Lambda=0.5 indicates square root transformation. If the SSE corresponding to Lambda=1 is above the red line, and the best is below the red line, then it will help to transform the data

5 Box-Cox Transformations Figure 1. On selecting the option, Save Best Transform, we see the recommended transform to be Y = ( (delta particles) 0.2 1) / Rerunning the model using the new transform yields the results seen in Figure 3 below. Sorted Parameter Estimates Term Estimate Std Error t Ratio Prob> t n2press(10,60) <.0001* drytime(60,300) * lrpm(2,2.903) lrpm*n2press rinsetime*lrpm rinsetime*drytime drytime*n2press lrpm*drytime rinsetime*n2press rinsetime(60,300) Table 2. In the new analysis, we see all the parameter estimates have changed. In fact the interaction for the two significant main effects, Nitrogen pressure * Drying time, has changed signs. Conclusion: The assumption that the error structure is Normally i.i.d. is often forgotten, but still important. Failing to confirm this can lead to misleading results. Example 2: (Amorphous Silicon Screening Experiment) In this example, engineers were trying to determine process settings to minimize wafer variability in an amorphous silicon deposition process, both within the wafers and between wafers. The amorphous silicon deposition process consists of loading the wafers into the vertical furnace, regulating the temperature at the top zone, flowing the silicon source, in this case, silane, from top to bottom, with an optional flow at the bottom, and controlling the stabilization time. Data was taken on 20 runs. Three wafers were measured for each run, one in the top zone, one in the center zone, and one in the bottom zone. The same 49 sites were measured on each wafer. From those 49 sites, the mean and variance or standard deviation across the wafer was calculated.

6 Summarizing the data over the wafers, the engineers were able to model all two-factor interactions for four main effects, yielding the results in Table 3. Sorted Parameter Estimates, within wafer var Term Estimate Std Error t Ratio Prob> t TPSI(30,150) * BTSI(0,150) TP-off*TPSI TP-off*STB-TIM TP-off*BTSI TPSI*BTSI TP-off(-5,0) TPSI*STB-TIM STB-TIM(15,45) BTSI*STB-TIM Sorted Parameter Estimates, wafer-to-wafer var Term Estimate Std Error t Ratio Prob> t TP-off*BTSI * TPSI(30,150) BTSI(0,150) TPSI*STB-TIM TP-off*TPSI TPSI*BTSI TP-off*STB-TIM STB-TIM(15,45) TP-off(-5,0) BTSI*STB-TIM Table 3. Parameter Estimates for non-transformed variances Considering that they were modeling variances, and considering that variances are rarely assumed to be normally distributed, it isn t surprising that the results from these models yielded incorrect results. Looking at the Box-Cox Transformation graphs in Figure 2, we see that a transformation is again appropriate here, with a Lambda of 0.2. Box-Cox Trans, within wafer var Box-Cox Trans, wafer-to-wafer var Figure 2. Applying the transformation to each wafer variance, and rerunning the analysis yields the results seen in Table 4 below.

7 Sorted Parameter Estimates, within wafer transformed Term Estimate Std Error t Ratio Prob> t TPSI(30,150) * BTSI(0,150) * TP-off*TPSI TPSI*BTSI TP-off(-5,0) TP-off*STB-TIM TP-off*BTSI BTSI*STB-TIM STB-TIM(15,45) TPSI*STB-TIM Sorted Parameter Estimates, wafer-to-wafer var transformed Term Estimate Std Error t Ratio Prob> t TP-off*BTSI * BTSI(0,150) TP-off(-5,0) STB-TIM(15,45) TPSI(30,150) BTSI*STB-TIM TP-off*TPSI TP-off*STB-TIM TPSI*BTSI TPSI*STB-TIM Table 4. Parameter estimates for transformed variances. In both of these resulting sets of parameter estimates, we see while the significance and magnitude of many of the effects hasn t changed much, that more than half of the parameter estimates have changed signs. Example 3: (Kidney Transplant wait times) In the previous two examples a Box-Cox transformation was used to transform the dependent variable to meet the assumption of normality or homogeneity. The square root transformation is commonly used to transform the dependent variable especially for count data. However, another approach would be to estimate the model using a nonnormal distribution. The following data example will demonstrate modeling count data with a generalized linear model using JMP/SAS Integration rather than using a Box-Cox transformation. The data for this example was modified from the US Department of Health and Human Services, Organ Procurement and Transplantation Network (OPTN) to reflect the number of days a patient was on the waiting list for a kidney donation. The researchers have patient data from 11 regions in the United States. When a patient received a kidney transplant their blood type, region, and number of days waiting was recorded. Some researchers wanted to model the number of days a person can expect to be on the kidney transplant list. In addition to predicting the number of days waiting, there is interest in determining if either blood type or region has an impact on the number of days a patient will have to wait for a kidney transplant.

8 This section will provide a brief review of the characteristics of Poisson and Negative Binomial or Gamma-Poisson models that are commonly used to model count data. The Poisson distribution is often used to model wait times, for example the number of call arrivals in a cellular network. Furthermore, we can use some empirical tests as ways of checking for a particular distribution to model. *The simplest and handiest way is to see if the variance is roughly equal to the mean for Poisson data. *A histogram of the Poisson data should be skewed right, though the skewness becomes less pronounced as the mean increases. Poisson regression has a very strong assumption that the conditional variance equals the conditional mean. Poisson regression is often used as a starting point for modeling count data and Poisson regression has many extensions. The Gamma Poisson or Negative binomial distribution can be used for over-dispersed count data, that is, when the conditional variance exceeds the conditional mean. It can be considered as a generalization of Poisson regression since it has the same mean structure as Poisson regression and it has an extra parameter to model the over-dispersion. The Gamma Poisson(λ, σ) is equivalent to a Negative Binomial(r, p), where σ = ((1-p)/p + 1) if the estimate for r = λ/(σ -1) is an integer. Let s look at this particular example data and analysis using a generalized linear model with both a Poisson and Negative Binomial distribution to model the number of days a patient waited to receive a kidney donation. If you examine the results from JMP s Distribution platform it is easy to see that the Poisson distribution is not a very good choice since the mean and variance differ by several orders of magnitude. In addition, the histogram of wait time is not the characteristic shape for the mass function of the Poisson distribution. Wait time Poisson( ) GammaPoisson( , ) Figure 3.

Furthermore, you can estimate a Poisson model using JMP s Fit Model platform by defining the Personality: Generalized Linear Model; Distribution: Poisson; Link: Log; and check the box to request

9 Furthermore, you can estimate a Poisson model using JMP s Fit Model platform by defining the Personality: Generalized Linear Model; Distribution: Poisson; Link: Log; and check the box to request Overdispersion tests and Intervals. Figure 4. The large value, , for the Overdispersion or ratio of the Pearson chi-square statistic and its degrees of freedom is indicative of a model shortcoming. The data are considerably more dispersed than is expected under a Poisson model. Possible reasons for this overdispersion include a misspecified mean model, data that might not be Poisson distributed, incorrect variance function, or correlations among the observations. Whole Model Test Model -LogLikelihood L-R ChiSquare DF Prob>ChiSq Difference <.0001* Full Reduced Goodness Of Fit Statistic ChiSquare DF Prob>ChiSq Overdispersion Pearson <.0001* Deviance <.0001* AICc Table 5. JMP results from Poisson model. Given that the conditional variance exceeds the conditional mean and there is potential evidence that the data might be overdispersed, the researchers therefore decided to pursue estimating a generalized linear model with a negative binomial distribution using the GLIMMIX procedure using the JMP/SAS integration capabilities.

10 To access SAS procedures using JMP/SAS integration, simply insert a SAS Connect() call and add a SAS Submit( ) around your SAS code, open it in a JMP scripting window, then click Run (or launch the JMP script). Alternatively, once the server connections have been defined you can right-click and select submit to SAS (local) or SAS(server). The model is fit initially with the following PROC GLIMMIX statements to estimate a negative binomial generalized linear model with effects for an intercept, blood type, region, and their interaction: SAS Connect(); SAS Submit(" PROC GLIMMIX; FREQ Freq; CLASS Blood_type Region; MODEL Wait_time = Blood_type Region Region*Blood_type/dist=negbin s ; RUN; ") The Model Information table shows that the parameters in this Negative Binomial model are estimated by maximum likelihood. The default link and variance function is used. The Class Level Information shows the nominal variables with their respective number of levels and values. The GLIMMIX Procedure Data Set Model Information WORK.KIDNEY_TRANSPLANT_WAITTIMES_STAC Response Variable Wait_time Response Distribution Negative Binomial Link Function Log Variance Function Default Frequency Variable Freq Variance Matrix Diagonal Estimation Technique Maximum Likelihood Degrees of Freedom Method Residual Class Level Information Class Levels Values Blood_type 4 A AB B O Region Number of Observations Read 352 Number of Observations Used 352 Sum of Frequencies Read Sum of Frequencies Used Table 6. SAS Model and Class Level Information.

11 The following iteration history shows the model converges. Iteration Restarts Evaluations Iteration History Objective Function Change Max Gradient Convergence criterion (GCONV=1E-8) satisfied Table 7. SAS Iteration History. The Fit Statistics table indicates a list of goodness of fit statistics. For a generalized linear model if the Pearson Chi-Square /DF value deviates much from 1 (above or below) then you will need to adjust for overdispersion or underdispersion. Fit Statistics -2 Log Likelihood AIC (smaller is better) AICC (smaller is better) BIC (smaller is better) CAIC (smaller is better) HQIC (smaller is better) Pearson Chi-Square Pearson Chi-Square / DF 0.85 Table 8. SAS Fit Statistics. The SCALE parameter is included in the parameter estimates for generalized linear models in the GLIMMIX procedure. The scale parameter is the estimate of the dispersion parameter. If the scale parameter equals zero, the model reduces to the simpler Poisson model; if dispersion is greater than zero, the response variable is over-dispersed; and if dispersion is less than zero, the response variable is under-dispersed. In this example, the SCALE parameter value was Keep in mind the Estimates are the estimated negative binomial regression coefficients for the model. Recall that the dependent variable is a count variable and the regression models the log of the expected count as a linear function of the predictor variables. If the variable is not on the class statement, we can interpret each regression coefficient as follows: for a one unit change in the predictor variable, the difference in the logs of expected counts of the response variable is expected to change by the respective regression coefficient, given the other predictor variables in the model are held constant. Because PROC GLIMMIX defines a set to zero solution for the class effects, each class level is

12 essentially compared to the last class level. Therefore, the solutions are interpreted as the coefficient for AB is an estimate of log(p for AB) - log(p for O), etc. Parameter Estimates Blood_ Standard Effect type Region Estimate Error DF t Value Pr > t Intercept <.0001 Blood_type A <.0001 Blood_type AB Blood_type B Blood_type O Region Region Region Region Region <.0001 Region Region Region <.0001 Region Region Region Blood_type*Region A Blood_type*Region A Blood_type*Region A Blood_type*Region A Blood_type*Region A Blood_type*Region A Blood_type*Region A Blood_type*Region A Blood_type*Region A Blood_type*Region A Blood_type*Region A Blood_type*Region AB Blood_type*Region AB Blood_type*Region AB Blood_type*Region AB Blood_type*Region AB Blood_type*Region AB Blood_type*Region AB Blood_type*Region AB Blood_type*Region AB Blood_type*Region AB Blood_type*Region AB Blood_type*Region B Blood_type*Region B Blood_type*Region B Blood_type*Region B Blood_type*Region B Blood_type*Region B Blood_type*Region B Blood_type*Region B Blood_type*Region B Blood_type*Region B Blood_type*Region B Blood_type*Region O Blood_type*Region O Blood_type*Region O Blood_type*Region O Blood_type*Region O Blood_type*Region O Blood_type*Region O Blood_type*Region O Blood_type*Region O Blood_type*Region O Blood_type*Region O Scale Table 9. SAS Parameter Estimates.

13 Type III Tests of Fixed Effects Effect Num DF Den DF F Value Pr > F Blood_type <.0001 Region <.0001 Blood_type*Region <.0001 Table 10. SAS Type III Tests of Fixed Effects. The "Type III Tests of Fixed Effects" table indicates a significant interaction between blood_type*region. We can use the SLICE option in the LSMEANS statement to examine the simple effects and to determine which level of blood_type within the 11 regions are contributing the interaction. The SLICEDIFF option will provide multiple comparisons tests of the simple effects. The PLOT=MEANPLOT(SLICEBY=BLOOD_TYPE) generates the graphical display of the simple effects of blood_type within the 11 regions. lsmeans region*blood_type/slice=blood_type slicediff=blood_type plot=meanplot(sliceby=blood_type join); RUN; Both the Meanplot and the results of Tests of Effect Slices for Blood_type*Region by blood_type indicate the simple effect of blood_type differ across the 11 regions. The following tables subset the pairwise comparisons of the simple effects of regions within blood_type to only those that were significant. Figure 5. LS Means plot showing lines for blood type at each region

14 The Least Squares Means plot indicates patients with blood_type AB have a shorter wait_time in 9 of the regions. Also, it shows that region 6 has an overall lower wait time than all other regions for all but blood type A. Figure 6. States within Regions Summary and Conclusion The goal of this paper was to introduce transformations to researchers as a technique to normalize variable or equalize variance. The Box-Cox transformation takes the idea of having a range of power transformations (rather than the classic square root, log, and inverse) available to improve the efficacy of normalizing and variance equalizing for both positively- and negatively-skewed variables. As the two examples presented above show, not only does Box-Cox easily normalize skewed data, but normalizing the data also can have a dramatic impact on the results in analyses. Further, both JMP and SAS incorporate powerful Box-Cox transformations. However, using JMP/SAS Integration Technology allows you to use PROC GLIMMIX to model many non-normal distributions rather than using a data transformation. Data transformations can introduce complexity into substantive interpretation of the results (as they change the nature of the variable, and any λ less than 0.00 has the effect of reversing the order of the data, and thus care should be taken when interpreting results.). Sakia (1992) briefly reviews the arguments revolving around this issue, as well as techniques for utilizing variables that have been power transformed in prediction or converting results back to the original metric of the variable. For example, Taylor (1986) describes a method of approximating the results of an analysis following

15 transformation, and others (see Sakia, 1992) have shown that this seems to be a relatively good solution in most cases. Given the potential benefits of utilizing transformations (e.g., meeting assumptions of analyses, improving generalizability of the results, improving effect sizes) the drawbacks do not seem compelling in the age of modern computing. References Bortkiewicz, L., von. (1898). Das Gesetz der kleinen Zahlen.Leipzig: G. Teubner. Box, G. E. P., & Cox, D. R. (1964). An analysis of transformations. Journal of the Royal Statistical Society, B, 26( ). Hilbe, Joseph M., Negative Binomial Regression, Cambridge, UK: Cambridge University Press (2007) Negative Binomial Regression Cambridge University Press Sakia, R. M. (1992). The Box-Cox transformation technique: A review. The Statistician, 41, Taylor, M. J. G. (1986). The retransformed mean after a fitted power transformation. Journal of the American Statistical Association, 81, Tukey, J. W. (1957). The comparative anatomy of transformations. Annals of Mathematical Statistics, 28, Zimmerman, D. W. (1995). Increasing the power of nonparametric tests by detecting and downweighting outliers. Journal of Experimental Education, 64(1), Contact Information Your comments and questions are valued and encouraged. Contact the authors at: Kathleen Kiernan Diane Michelson Annie Dudley Zangi SAS Institute, JMP Division Cary, NC Kathleen.Kiernan@sas.com Di.Michelson@sas.com Annie.Zangi@jmp.com

Topic 23: Diagnostics and Remedies

Topic 23: Diagnostics and Remedies Outline Diagnostics residual checks ANOVA remedial measures Diagnostics Overview We will take the diagnostics and remedial measures that we learned for regression and