PARTICLE COUNTS MISLEAD SCIENTISTS: A JMP AND SAS INTEGRATION CASE STUDY INVESTIGATING MODELING

Size: px
Start display at page:

Download "PARTICLE COUNTS MISLEAD SCIENTISTS: A JMP AND SAS INTEGRATION CASE STUDY INVESTIGATING MODELING"

Transcription

1 PARTICLE COUNTS MISLEAD SCIENTISTS: A JMP AND SAS INTEGRATION CASE STUDY INVESTIGATING MODELING Kathleen Kiernan, Diane Michelson, PhD, Annie Zangi, SAS Many researchers and statisticians deal with data that do not meet the assumptions of normality or homogeneity of variance. Research has shown parametric tests such as regression or ANOVA can be robust to modest violations of these assumptions. Data transformations are commonly used tools that can provide methods to improve normality of a distribution and equalize variance to meet assumptions. There are many potential types of data transformations. Some of the more traditional types and common types of data transformations include square root, logarithmic, natural log, and inverse transformations for improving normality. However, the Box-Cox transformation (1964) represents a family of power transformations that include some of the traditional transformations to help researchers find the optimal normalizing transformation for each variable. That said, engineers and researchers find the Box-Cox transformation represents a good technique where normalizing data or equalizing variance is desired. This paper will provide an overview of the Box-Cox transformation and address specific questions relating to using the Box-Cox transformation in statistical analysis. Examples of applications of the Box-Cox transformation are presented including details how to obtain a Box-Cox transformation using JMP/SAS Integration Technology. Many papers and research has been done since the original Box-Cox transformation paper (1964) that provides a family of transformations that will optimally normalize a particular variable, eliminating the need to randomly try different transformations to determine the best option. Perhaps Box and Cox (1964) originally envisioned this transformation as a solution for simultaneously correcting normality, linearity, and homoscedasticity. While these transformations often improve all of these aspects of a distribution or analysis, Sakia (1992) and others have noted it does not always accomplish these challenging goals. Many statistical procedures make two assumptions that are relevant to this topic: (a) an assumption that the variables (or the errors (residuals), more technically correct) are normally distributed, and (b) an assumption of homoscedasticity or homogeneity of variance, meaning that the variance of the variable remains constant over the observed range of some other variable. In regression analyses this second assumption is that the variance around the regression line is constant across the entire observed range of data. In ANOVA analyses, this assumption is that the variance in one cell is not significantly different from that of other cells. Most statistical software packages provide ways to test both assumptions. Significant violation of either assumption can increase your chances of committing either a Type I or II error (depending on the nature of the analysis and violation of the assumption). Yet few researchers test these assumptions, and fewer still report correcting for violation of these assumptions. This is unfortunate, given that in most cases it is relatively simple to correct this problem through the application of data transformations or fit statistical models to data with correlations or nonconstant variability and where the

2 response is not necessarily normally distributed. Even when one is using analyses considered robust to violations of these assumptions or non-parametric tests (that do not explicitly assume normally distributed error terms), attending to these issues can improve the results of the analyses (e.g., Zimmerman, 1995). How does one test assumption of normality and homogeneity of variance? The following information taken from the UNIVARIATE chapter of the SAS Procedures Guide explains how to interpret the results of the NORMAL option tests: You determine whether to reject the null hypothesis by examining the probability that is associated with a test statistic. When the p-value is less than the predetermined critical value (alpha value), you reject the null hypothesis and conclude that the data does not came from the theoretical distribution. If you want to test the normality assumptions that underlie analysis of variance methods, beware of using a statistical test for normality alone. A test's ability to reject the null hypothesis (known as the power of the test) increases with the sample size. As the sample size becomes larger, increasingly smaller departures from normality can be detected. Since small deviations from normality do not severely affect the validity of analysis of variance tests, it is important to examine other statistics and plots to make a final assessment of normality. The skewness and kurtosis measures and the plots that are provided by the PLOTS option, the HISTOGRAM statement, PROBPLOT statement, and QQPLOT statement can be very helpful. For small sample sizes, power is low for detecting larger departures from normality that may be important. To increase the test's ability to detect such deviations, you may want to declare significance at higher levels, such as 0.15 or 0.20, rather than the often-used 0.05 level. Again, consulting plots and additional statistics will help you assess the severity of the deviations from normality. Traditional data transformations used to improve normality This particular section of the paper will briefly highlight some points regarding some of the commonly used transformations. Square root transformations are traditionally thought of as good for normalizing Poisson distributions (most common with data that are counts of occurrences, such as number of times an insurance claim is filed for specific age group and type of car or the famous example of the number of soldiers in the Prussian Cavalry killed by horse kicks each year (Bortkiewicz, 1898) and equalizing variance. Log transformations are actually a class of transformations, rather than a single transformation, and in many fields of science log-normal variables (i.e., normally distributed after log transformation) are relatively common. Log-normal variables seem to be more common when outcomes are influenced by many independent factors (e.g., biological outcomes).

3 The Arcsine transformation has traditionally been used for proportions, (which range from 0.00 to 1.00), and involves of taking the arcsine of the square root of a number, with the resulting transformed data reported in radians. Because of the mathematical properties of this transformation, the variable must be transformed to the range 1.00 to While a perfectly valid transformation, other modern techniques may limit the need for this transformation. For example, rather than aggregating original binary outcome data to a proportion, researchers can use logistic regression on the original data. Box- Cox transformation. You may have noticed that many potential transformations, including several discussed above, are all members of a class of transformations called power transformations. Power transformations are merely transformations that raise numbers to an exponent (power). For example, a square root transformation can be characterized as x 1/2, inverse transformations can be characterized as x -1 and so forth. Tukey (1957) is often credited with presenting the initial idea that transformations can be thought of as a class or family of similar mathematical functions. This idea was modified by Box and Cox (1964) and is used to find potentially nonlinear transformations of a dependent variable. The Box-Cox transformation has the form y λ-1 / λ λ 0 log(y) λ=0 This family of transformations of the positive dependent variable y is controlled by the parameter λ. Transformations linearly related to square root, inverse, quadratic, cubic, and so on are all special cases. The limit as λ approaches 0 is the log transformation. More generally, Box-Cox transformations of the following form can be fit: ( (y+c) λ - 1 ) / (λg) λ 0 log(y+c)/g λ=0 By default, c=0. The parameter c can be used to rescale y so that it is strictly positive. By default, g=1. Alternatively, g can be y λ-1, where y is the geometric mean of y. Both JMP and SAS estimate lambda λ, the Box-Cox transformation coefficient using a maximum likelihood approach and to estimate the effects of a selected range of λ automatically. Given that λ can take on an almost infinite number of values, we can theoretically calibrate a transformation to be maximally effective in moving a variable toward normality, regardless of whether it is negatively or positively skewed. Additionally, as mentioned above, this family of transformations incorporates many traditional transformations: λ = 1.00: no transformation needed; produces results identical to original data λ = 0.50: square root transformation λ = 0.33: cube root transformation λ = 0.25: fourth root transformation

4 λ = 0.00: natural log transformation λ = -0.50: reciprocal square root transformation λ = -1.00: reciprocal (inverse) transformation and so forth Example 1 (Spin Rinse Dryer Optimization Experiment): In this example, engineers were trying to determine process settings to minimize delta particles. Wafers are loaded into a spin rinse dryer and processing begins. Deionized (DI) water is sprayed on the wafers while they are spun for a certain length of time, at a certain spin speed. Wafers are then dried by spraying with low pressure nitrogen. Particle counts are measured on each wafer before and after the clean on two wafers, and the average is reported. The factors being considered are: rinse time, spin speed (lrpm), drying time, and nitrogen pressure blast. As the design was a full-factorial, they were able to model all two-factor interactions and main effects, which yielded the results in Table 1. Sorted Parameter Estimates Term Estimate Std Error t Ratio Prob> t n2press(10,60) <.0001* drytime(60,300) lrpm*n2press lrpm(2,2.903) lrpm*drytime drytime*n2press rinsetime*lrpm rinsetime*drytime rinsetime*n2press rinsetime(60,300) Table 1. Problem: These results caused the engineers to jump to invalid conclusions. Because both dry time, and nitrogen pressure were significant positive effects, and the interaction was also positive, they changed the process to be centered around higher values of both time and pressure. However, the new results weren t improved. A closer look: One option in Fit Model to evaluate the fit and error structure is the Box Cox Transformation. When we select this option, we see the graph in Figure 1 below. In this graph, we see a transformation would indeed be appropriate here. The graph is a plot of SSE vs lambda for various values of lambda. The model with the lowest SSE is judged best. The red line gives a confidence limit on the lambdas. Models with SSE below the red line are all the same. Lambda=1 indicates no transformation necessary. Lambda=0 indicates log transformation. Lambda=0.5 indicates square root transformation. If the SSE corresponding to Lambda=1 is above the red line, and the best is below the red line, then it will help to transform the data

5 Box-Cox Transformations Figure 1. On selecting the option, Save Best Transform, we see the recommended transform to be Y = ( (delta particles) 0.2 1) / Rerunning the model using the new transform yields the results seen in Figure 3 below. Sorted Parameter Estimates Term Estimate Std Error t Ratio Prob> t n2press(10,60) <.0001* drytime(60,300) * lrpm(2,2.903) lrpm*n2press rinsetime*lrpm rinsetime*drytime drytime*n2press lrpm*drytime rinsetime*n2press rinsetime(60,300) Table 2. In the new analysis, we see all the parameter estimates have changed. In fact the interaction for the two significant main effects, Nitrogen pressure * Drying time, has changed signs. Conclusion: The assumption that the error structure is Normally i.i.d. is often forgotten, but still important. Failing to confirm this can lead to misleading results. Example 2: (Amorphous Silicon Screening Experiment) In this example, engineers were trying to determine process settings to minimize wafer variability in an amorphous silicon deposition process, both within the wafers and between wafers. The amorphous silicon deposition process consists of loading the wafers into the vertical furnace, regulating the temperature at the top zone, flowing the silicon source, in this case, silane, from top to bottom, with an optional flow at the bottom, and controlling the stabilization time. Data was taken on 20 runs. Three wafers were measured for each run, one in the top zone, one in the center zone, and one in the bottom zone. The same 49 sites were measured on each wafer. From those 49 sites, the mean and variance or standard deviation across the wafer was calculated.

6 Summarizing the data over the wafers, the engineers were able to model all two-factor interactions for four main effects, yielding the results in Table 3. Sorted Parameter Estimates, within wafer var Term Estimate Std Error t Ratio Prob> t TPSI(30,150) * BTSI(0,150) TP-off*TPSI TP-off*STB-TIM TP-off*BTSI TPSI*BTSI TP-off(-5,0) TPSI*STB-TIM STB-TIM(15,45) BTSI*STB-TIM Sorted Parameter Estimates, wafer-to-wafer var Term Estimate Std Error t Ratio Prob> t TP-off*BTSI * TPSI(30,150) BTSI(0,150) TPSI*STB-TIM TP-off*TPSI TPSI*BTSI TP-off*STB-TIM STB-TIM(15,45) TP-off(-5,0) BTSI*STB-TIM Table 3. Parameter Estimates for non-transformed variances Considering that they were modeling variances, and considering that variances are rarely assumed to be normally distributed, it isn t surprising that the results from these models yielded incorrect results. Looking at the Box-Cox Transformation graphs in Figure 2, we see that a transformation is again appropriate here, with a Lambda of 0.2. Box-Cox Trans, within wafer var Box-Cox Trans, wafer-to-wafer var Figure 2. Applying the transformation to each wafer variance, and rerunning the analysis yields the results seen in Table 4 below.

7 Sorted Parameter Estimates, within wafer transformed Term Estimate Std Error t Ratio Prob> t TPSI(30,150) * BTSI(0,150) * TP-off*TPSI TPSI*BTSI TP-off(-5,0) TP-off*STB-TIM TP-off*BTSI BTSI*STB-TIM STB-TIM(15,45) TPSI*STB-TIM Sorted Parameter Estimates, wafer-to-wafer var transformed Term Estimate Std Error t Ratio Prob> t TP-off*BTSI * BTSI(0,150) TP-off(-5,0) STB-TIM(15,45) TPSI(30,150) BTSI*STB-TIM TP-off*TPSI TP-off*STB-TIM TPSI*BTSI TPSI*STB-TIM Table 4. Parameter estimates for transformed variances. In both of these resulting sets of parameter estimates, we see while the significance and magnitude of many of the effects hasn t changed much, that more than half of the parameter estimates have changed signs. Example 3: (Kidney Transplant wait times) In the previous two examples a Box-Cox transformation was used to transform the dependent variable to meet the assumption of normality or homogeneity. The square root transformation is commonly used to transform the dependent variable especially for count data. However, another approach would be to estimate the model using a nonnormal distribution. The following data example will demonstrate modeling count data with a generalized linear model using JMP/SAS Integration rather than using a Box-Cox transformation. The data for this example was modified from the US Department of Health and Human Services, Organ Procurement and Transplantation Network (OPTN) to reflect the number of days a patient was on the waiting list for a kidney donation. The researchers have patient data from 11 regions in the United States. When a patient received a kidney transplant their blood type, region, and number of days waiting was recorded. Some researchers wanted to model the number of days a person can expect to be on the kidney transplant list. In addition to predicting the number of days waiting, there is interest in determining if either blood type or region has an impact on the number of days a patient will have to wait for a kidney transplant.

8 This section will provide a brief review of the characteristics of Poisson and Negative Binomial or Gamma-Poisson models that are commonly used to model count data. The Poisson distribution is often used to model wait times, for example the number of call arrivals in a cellular network. Furthermore, we can use some empirical tests as ways of checking for a particular distribution to model. *The simplest and handiest way is to see if the variance is roughly equal to the mean for Poisson data. *A histogram of the Poisson data should be skewed right, though the skewness becomes less pronounced as the mean increases. Poisson regression has a very strong assumption that the conditional variance equals the conditional mean. Poisson regression is often used as a starting point for modeling count data and Poisson regression has many extensions. The Gamma Poisson or Negative binomial distribution can be used for over-dispersed count data, that is, when the conditional variance exceeds the conditional mean. It can be considered as a generalization of Poisson regression since it has the same mean structure as Poisson regression and it has an extra parameter to model the over-dispersion. The Gamma Poisson(λ, σ) is equivalent to a Negative Binomial(r, p), where σ = ((1-p)/p + 1) if the estimate for r = λ/(σ -1) is an integer. Let s look at this particular example data and analysis using a generalized linear model with both a Poisson and Negative Binomial distribution to model the number of days a patient waited to receive a kidney donation. If you examine the results from JMP s Distribution platform it is easy to see that the Poisson distribution is not a very good choice since the mean and variance differ by several orders of magnitude. In addition, the histogram of wait time is not the characteristic shape for the mass function of the Poisson distribution. Wait time Poisson( ) GammaPoisson( , ) Figure 3.

9 Furthermore, you can estimate a Poisson model using JMP s Fit Model platform by defining the Personality: Generalized Linear Model; Distribution: Poisson; Link: Log; and check the box to request Overdispersion tests and Intervals. Figure 4. The large value, , for the Overdispersion or ratio of the Pearson chi-square statistic and its degrees of freedom is indicative of a model shortcoming. The data are considerably more dispersed than is expected under a Poisson model. Possible reasons for this overdispersion include a misspecified mean model, data that might not be Poisson distributed, incorrect variance function, or correlations among the observations. Whole Model Test Model -LogLikelihood L-R ChiSquare DF Prob>ChiSq Difference <.0001* Full Reduced Goodness Of Fit Statistic ChiSquare DF Prob>ChiSq Overdispersion Pearson <.0001* Deviance <.0001* AICc Table 5. JMP results from Poisson model. Given that the conditional variance exceeds the conditional mean and there is potential evidence that the data might be overdispersed, the researchers therefore decided to pursue estimating a generalized linear model with a negative binomial distribution using the GLIMMIX procedure using the JMP/SAS integration capabilities.

10 To access SAS procedures using JMP/SAS integration, simply insert a SAS Connect() call and add a SAS Submit( ) around your SAS code, open it in a JMP scripting window, then click Run (or launch the JMP script). Alternatively, once the server connections have been defined you can right-click and select submit to SAS (local) or SAS(server). The model is fit initially with the following PROC GLIMMIX statements to estimate a negative binomial generalized linear model with effects for an intercept, blood type, region, and their interaction: SAS Connect(); SAS Submit(" PROC GLIMMIX; FREQ Freq; CLASS Blood_type Region; MODEL Wait_time = Blood_type Region Region*Blood_type/dist=negbin s ; RUN; ") The Model Information table shows that the parameters in this Negative Binomial model are estimated by maximum likelihood. The default link and variance function is used. The Class Level Information shows the nominal variables with their respective number of levels and values. The GLIMMIX Procedure Data Set Model Information WORK.KIDNEY_TRANSPLANT_WAITTIMES_STAC Response Variable Wait_time Response Distribution Negative Binomial Link Function Log Variance Function Default Frequency Variable Freq Variance Matrix Diagonal Estimation Technique Maximum Likelihood Degrees of Freedom Method Residual Class Level Information Class Levels Values Blood_type 4 A AB B O Region Number of Observations Read 352 Number of Observations Used 352 Sum of Frequencies Read Sum of Frequencies Used Table 6. SAS Model and Class Level Information.

11 The following iteration history shows the model converges. Iteration Restarts Evaluations Iteration History Objective Function Change Max Gradient Convergence criterion (GCONV=1E-8) satisfied Table 7. SAS Iteration History. The Fit Statistics table indicates a list of goodness of fit statistics. For a generalized linear model if the Pearson Chi-Square /DF value deviates much from 1 (above or below) then you will need to adjust for overdispersion or underdispersion. Fit Statistics -2 Log Likelihood AIC (smaller is better) AICC (smaller is better) BIC (smaller is better) CAIC (smaller is better) HQIC (smaller is better) Pearson Chi-Square Pearson Chi-Square / DF 0.85 Table 8. SAS Fit Statistics. The SCALE parameter is included in the parameter estimates for generalized linear models in the GLIMMIX procedure. The scale parameter is the estimate of the dispersion parameter. If the scale parameter equals zero, the model reduces to the simpler Poisson model; if dispersion is greater than zero, the response variable is over-dispersed; and if dispersion is less than zero, the response variable is under-dispersed. In this example, the SCALE parameter value was Keep in mind the Estimates are the estimated negative binomial regression coefficients for the model. Recall that the dependent variable is a count variable and the regression models the log of the expected count as a linear function of the predictor variables. If the variable is not on the class statement, we can interpret each regression coefficient as follows: for a one unit change in the predictor variable, the difference in the logs of expected counts of the response variable is expected to change by the respective regression coefficient, given the other predictor variables in the model are held constant. Because PROC GLIMMIX defines a set to zero solution for the class effects, each class level is

12 essentially compared to the last class level. Therefore, the solutions are interpreted as the coefficient for AB is an estimate of log(p for AB) - log(p for O), etc. Parameter Estimates Blood_ Standard Effect type Region Estimate Error DF t Value Pr > t Intercept <.0001 Blood_type A <.0001 Blood_type AB Blood_type B Blood_type O Region Region Region Region Region <.0001 Region Region Region <.0001 Region Region Region Blood_type*Region A Blood_type*Region A Blood_type*Region A Blood_type*Region A Blood_type*Region A Blood_type*Region A Blood_type*Region A Blood_type*Region A Blood_type*Region A Blood_type*Region A Blood_type*Region A Blood_type*Region AB Blood_type*Region AB Blood_type*Region AB Blood_type*Region AB Blood_type*Region AB Blood_type*Region AB Blood_type*Region AB Blood_type*Region AB Blood_type*Region AB Blood_type*Region AB Blood_type*Region AB Blood_type*Region B Blood_type*Region B Blood_type*Region B Blood_type*Region B Blood_type*Region B Blood_type*Region B Blood_type*Region B Blood_type*Region B Blood_type*Region B Blood_type*Region B Blood_type*Region B Blood_type*Region O Blood_type*Region O Blood_type*Region O Blood_type*Region O Blood_type*Region O Blood_type*Region O Blood_type*Region O Blood_type*Region O Blood_type*Region O Blood_type*Region O Blood_type*Region O Scale Table 9. SAS Parameter Estimates.

13 Type III Tests of Fixed Effects Effect Num DF Den DF F Value Pr > F Blood_type <.0001 Region <.0001 Blood_type*Region <.0001 Table 10. SAS Type III Tests of Fixed Effects. The "Type III Tests of Fixed Effects" table indicates a significant interaction between blood_type*region. We can use the SLICE option in the LSMEANS statement to examine the simple effects and to determine which level of blood_type within the 11 regions are contributing the interaction. The SLICEDIFF option will provide multiple comparisons tests of the simple effects. The PLOT=MEANPLOT(SLICEBY=BLOOD_TYPE) generates the graphical display of the simple effects of blood_type within the 11 regions. lsmeans region*blood_type/slice=blood_type slicediff=blood_type plot=meanplot(sliceby=blood_type join); RUN; Both the Meanplot and the results of Tests of Effect Slices for Blood_type*Region by blood_type indicate the simple effect of blood_type differ across the 11 regions. The following tables subset the pairwise comparisons of the simple effects of regions within blood_type to only those that were significant. Figure 5. LS Means plot showing lines for blood type at each region

14 The Least Squares Means plot indicates patients with blood_type AB have a shorter wait_time in 9 of the regions. Also, it shows that region 6 has an overall lower wait time than all other regions for all but blood type A. Figure 6. States within Regions Summary and Conclusion The goal of this paper was to introduce transformations to researchers as a technique to normalize variable or equalize variance. The Box-Cox transformation takes the idea of having a range of power transformations (rather than the classic square root, log, and inverse) available to improve the efficacy of normalizing and variance equalizing for both positively- and negatively-skewed variables. As the two examples presented above show, not only does Box-Cox easily normalize skewed data, but normalizing the data also can have a dramatic impact on the results in analyses. Further, both JMP and SAS incorporate powerful Box-Cox transformations. However, using JMP/SAS Integration Technology allows you to use PROC GLIMMIX to model many non-normal distributions rather than using a data transformation. Data transformations can introduce complexity into substantive interpretation of the results (as they change the nature of the variable, and any λ less than 0.00 has the effect of reversing the order of the data, and thus care should be taken when interpreting results.). Sakia (1992) briefly reviews the arguments revolving around this issue, as well as techniques for utilizing variables that have been power transformed in prediction or converting results back to the original metric of the variable. For example, Taylor (1986) describes a method of approximating the results of an analysis following

15 transformation, and others (see Sakia, 1992) have shown that this seems to be a relatively good solution in most cases. Given the potential benefits of utilizing transformations (e.g., meeting assumptions of analyses, improving generalizability of the results, improving effect sizes) the drawbacks do not seem compelling in the age of modern computing. References Bortkiewicz, L., von. (1898). Das Gesetz der kleinen Zahlen.Leipzig: G. Teubner. Box, G. E. P., & Cox, D. R. (1964). An analysis of transformations. Journal of the Royal Statistical Society, B, 26( ). Hilbe, Joseph M., Negative Binomial Regression, Cambridge, UK: Cambridge University Press (2007) Negative Binomial Regression Cambridge University Press Sakia, R. M. (1992). The Box-Cox transformation technique: A review. The Statistician, 41, Taylor, M. J. G. (1986). The retransformed mean after a fitted power transformation. Journal of the American Statistical Association, 81, Tukey, J. W. (1957). The comparative anatomy of transformations. Annals of Mathematical Statistics, 28, Zimmerman, D. W. (1995). Increasing the power of nonparametric tests by detecting and downweighting outliers. Journal of Experimental Education, 64(1), Contact Information Your comments and questions are valued and encouraged. Contact the authors at: Kathleen Kiernan Diane Michelson Annie Dudley Zangi SAS Institute, JMP Division Cary, NC Kathleen.Kiernan@sas.com Di.Michelson@sas.com Annie.Zangi@jmp.com

Topic 23: Diagnostics and Remedies

Topic 23: Diagnostics and Remedies Topic 23: Diagnostics and Remedies Outline Diagnostics residual checks ANOVA remedial measures Diagnostics Overview We will take the diagnostics and remedial measures that we learned for regression and

More information

Class Notes: Week 8. Probit versus Logit Link Functions and Count Data

Class Notes: Week 8. Probit versus Logit Link Functions and Count Data Ronald Heck Class Notes: Week 8 1 Class Notes: Week 8 Probit versus Logit Link Functions and Count Data This week we ll take up a couple of issues. The first is working with a probit link function. While

More information

Chapter 1 Statistical Inference

Chapter 1 Statistical Inference Chapter 1 Statistical Inference causal inference To infer causality, you need a randomized experiment (or a huge observational study and lots of outside information). inference to populations Generalizations

More information

Topic 20: Single Factor Analysis of Variance

Topic 20: Single Factor Analysis of Variance Topic 20: Single Factor Analysis of Variance Outline Single factor Analysis of Variance One set of treatments Cell means model Factor effects model Link to linear regression using indicator explanatory

More information

Analysis of Count Data A Business Perspective. George J. Hurley Sr. Research Manager The Hershey Company Milwaukee June 2013

Analysis of Count Data A Business Perspective. George J. Hurley Sr. Research Manager The Hershey Company Milwaukee June 2013 Analysis of Count Data A Business Perspective George J. Hurley Sr. Research Manager The Hershey Company Milwaukee June 2013 Overview Count data Methods Conclusions 2 Count data Count data Anything with

More information

dm'log;clear;output;clear'; options ps=512 ls=99 nocenter nodate nonumber nolabel FORMCHAR=" = -/\<>*"; ODS LISTING;

dm'log;clear;output;clear'; options ps=512 ls=99 nocenter nodate nonumber nolabel FORMCHAR= = -/\<>*; ODS LISTING; dm'log;clear;output;clear'; options ps=512 ls=99 nocenter nodate nonumber nolabel FORMCHAR=" ---- + ---+= -/\*"; ODS LISTING; *** Table 23.2 ********************************************; *** Moore, David

More information

y response variable x 1, x 2,, x k -- a set of explanatory variables

y response variable x 1, x 2,, x k -- a set of explanatory variables 11. Multiple Regression and Correlation y response variable x 1, x 2,, x k -- a set of explanatory variables In this chapter, all variables are assumed to be quantitative. Chapters 12-14 show how to incorporate

More information

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages:

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages: Glossary The ISI glossary of statistical terms provides definitions in a number of different languages: http://isi.cbs.nl/glossary/index.htm Adjusted r 2 Adjusted R squared measures the proportion of the

More information

MODELING COUNT DATA Joseph M. Hilbe

MODELING COUNT DATA Joseph M. Hilbe MODELING COUNT DATA Joseph M. Hilbe Arizona State University Count models are a subset of discrete response regression models. Count data are distributed as non-negative integers, are intrinsically heteroskedastic,

More information

LISA Short Course Series Generalized Linear Models (GLMs) & Categorical Data Analysis (CDA) in R. Liang (Sally) Shan Nov. 4, 2014

LISA Short Course Series Generalized Linear Models (GLMs) & Categorical Data Analysis (CDA) in R. Liang (Sally) Shan Nov. 4, 2014 LISA Short Course Series Generalized Linear Models (GLMs) & Categorical Data Analysis (CDA) in R Liang (Sally) Shan Nov. 4, 2014 L Laboratory for Interdisciplinary Statistical Analysis LISA helps VT researchers

More information

Generalized Linear Models (GLZ)

Generalized Linear Models (GLZ) Generalized Linear Models (GLZ) Generalized Linear Models (GLZ) are an extension of the linear modeling process that allows models to be fit to data that follow probability distributions other than the

More information

STAT 5200 Handout #26. Generalized Linear Mixed Models

STAT 5200 Handout #26. Generalized Linear Mixed Models STAT 5200 Handout #26 Generalized Linear Mixed Models Up until now, we have assumed our error terms are normally distributed. What if normality is not realistic due to the nature of the data? (For example,

More information

PENALIZING YOUR MODELS

PENALIZING YOUR MODELS PENALIZING YOUR MODELS AN OVERVIEW OF THE GENERALIZED REGRESSION PLATFORM Michael Crotty & Clay Barker Research Statisticians JMP Division, SAS Institute Copyr i g ht 2012, SAS Ins titut e Inc. All rights

More information

Outline. Topic 20 - Diagnostics and Remedies. Residuals. Overview. Diagnostics Plots Residual checks Formal Tests. STAT Fall 2013

Outline. Topic 20 - Diagnostics and Remedies. Residuals. Overview. Diagnostics Plots Residual checks Formal Tests. STAT Fall 2013 Topic 20 - Diagnostics and Remedies - Fall 2013 Diagnostics Plots Residual checks Formal Tests Remedial Measures Outline Topic 20 2 General assumptions Overview Normally distributed error terms Independent

More information

Model Estimation Example

Model Estimation Example Ronald H. Heck 1 EDEP 606: Multivariate Methods (S2013) April 7, 2013 Model Estimation Example As we have moved through the course this semester, we have encountered the concept of model estimation. Discussions

More information

Final Exam. Name: Solution:

Final Exam. Name: Solution: Final Exam. Name: Instructions. Answer all questions on the exam. Open books, open notes, but no electronic devices. The first 13 problems are worth 5 points each. The rest are worth 1 point each. HW1.

More information

Diagnostics and Transformations Part 2

Diagnostics and Transformations Part 2 Diagnostics and Transformations Part 2 Bivariate Linear Regression James H. Steiger Department of Psychology and Human Development Vanderbilt University Multilevel Regression Modeling, 2009 Diagnostics

More information

1 A Review of Correlation and Regression

1 A Review of Correlation and Regression 1 A Review of Correlation and Regression SW, Chapter 12 Suppose we select n = 10 persons from the population of college seniors who plan to take the MCAT exam. Each takes the test, is coached, and then

More information

Exam Applied Statistical Regression. Good Luck!

Exam Applied Statistical Regression. Good Luck! Dr. M. Dettling Summer 2011 Exam Applied Statistical Regression Approved: Tables: Note: Any written material, calculator (without communication facility). Attached. All tests have to be done at the 5%-level.

More information

Tento projekt je spolufinancován Evropským sociálním fondem a Státním rozpočtem ČR InoBio CZ.1.07/2.2.00/

Tento projekt je spolufinancován Evropským sociálním fondem a Státním rozpočtem ČR InoBio CZ.1.07/2.2.00/ Tento projekt je spolufinancován Evropským sociálním fondem a Státním rozpočtem ČR InoBio CZ.1.07/2.2.00/28.0018 Statistical Analysis in Ecology using R Linear Models/GLM Ing. Daniel Volařík, Ph.D. 13.

More information

Generalized linear models

Generalized linear models Generalized linear models Douglas Bates November 01, 2010 Contents 1 Definition 1 2 Links 2 3 Estimating parameters 5 4 Example 6 5 Model building 8 6 Conclusions 8 7 Summary 9 1 Generalized Linear Models

More information

Logistic Regression: Regression with a Binary Dependent Variable

Logistic Regression: Regression with a Binary Dependent Variable Logistic Regression: Regression with a Binary Dependent Variable LEARNING OBJECTIVES Upon completing this chapter, you should be able to do the following: State the circumstances under which logistic regression

More information

EPSY 905: Fundamentals of Multivariate Modeling Online Lecture #7

EPSY 905: Fundamentals of Multivariate Modeling Online Lecture #7 Introduction to Generalized Univariate Models: Models for Binary Outcomes EPSY 905: Fundamentals of Multivariate Modeling Online Lecture #7 EPSY 905: Intro to Generalized In This Lecture A short review

More information

UNIVERSITY OF TORONTO. Faculty of Arts and Science APRIL 2010 EXAMINATIONS STA 303 H1S / STA 1002 HS. Duration - 3 hours. Aids Allowed: Calculator

UNIVERSITY OF TORONTO. Faculty of Arts and Science APRIL 2010 EXAMINATIONS STA 303 H1S / STA 1002 HS. Duration - 3 hours. Aids Allowed: Calculator UNIVERSITY OF TORONTO Faculty of Arts and Science APRIL 2010 EXAMINATIONS STA 303 H1S / STA 1002 HS Duration - 3 hours Aids Allowed: Calculator LAST NAME: FIRST NAME: STUDENT NUMBER: There are 27 pages

More information

CHAPTER 17 CHI-SQUARE AND OTHER NONPARAMETRIC TESTS FROM: PAGANO, R. R. (2007)

CHAPTER 17 CHI-SQUARE AND OTHER NONPARAMETRIC TESTS FROM: PAGANO, R. R. (2007) FROM: PAGANO, R. R. (007) I. INTRODUCTION: DISTINCTION BETWEEN PARAMETRIC AND NON-PARAMETRIC TESTS Statistical inference tests are often classified as to whether they are parametric or nonparametric Parameter

More information

Review of Statistics 101

Review of Statistics 101 Review of Statistics 101 We review some important themes from the course 1. Introduction Statistics- Set of methods for collecting/analyzing data (the art and science of learning from data). Provides methods

More information

Contents. Acknowledgments. xix

Contents. Acknowledgments. xix Table of Preface Acknowledgments page xv xix 1 Introduction 1 The Role of the Computer in Data Analysis 1 Statistics: Descriptive and Inferential 2 Variables and Constants 3 The Measurement of Variables

More information

Stat/F&W Ecol/Hort 572 Review Points Ané, Spring 2010

Stat/F&W Ecol/Hort 572 Review Points Ané, Spring 2010 1 Linear models Y = Xβ + ɛ with ɛ N (0, σ 2 e) or Y N (Xβ, σ 2 e) where the model matrix X contains the information on predictors and β includes all coefficients (intercept, slope(s) etc.). 1. Number of

More information

401 Review. 6. Power analysis for one/two-sample hypothesis tests and for correlation analysis.

401 Review. 6. Power analysis for one/two-sample hypothesis tests and for correlation analysis. 401 Review Major topics of the course 1. Univariate analysis 2. Bivariate analysis 3. Simple linear regression 4. Linear algebra 5. Multiple regression analysis Major analysis methods 1. Graphical analysis

More information

4.1. Introduction: Comparing Means

4.1. Introduction: Comparing Means 4. Analysis of Variance (ANOVA) 4.1. Introduction: Comparing Means Consider the problem of testing H 0 : µ 1 = µ 2 against H 1 : µ 1 µ 2 in two independent samples of two different populations of possibly

More information

Nonparametric Statistics. Leah Wright, Tyler Ross, Taylor Brown

Nonparametric Statistics. Leah Wright, Tyler Ross, Taylor Brown Nonparametric Statistics Leah Wright, Tyler Ross, Taylor Brown Before we get to nonparametric statistics, what are parametric statistics? These statistics estimate and test population means, while holding

More information

Parametric Modelling of Over-dispersed Count Data. Part III / MMath (Applied Statistics) 1

Parametric Modelling of Over-dispersed Count Data. Part III / MMath (Applied Statistics) 1 Parametric Modelling of Over-dispersed Count Data Part III / MMath (Applied Statistics) 1 Introduction Poisson regression is the de facto approach for handling count data What happens then when Poisson

More information

DISPLAYING THE POISSON REGRESSION ANALYSIS

DISPLAYING THE POISSON REGRESSION ANALYSIS Chapter 17 Poisson Regression Chapter Table of Contents DISPLAYING THE POISSON REGRESSION ANALYSIS...264 ModelInformation...269 SummaryofFit...269 AnalysisofDeviance...269 TypeIII(Wald)Tests...269 MODIFYING

More information

Generalized Linear. Mixed Models. Methods and Applications. Modern Concepts, Walter W. Stroup. Texts in Statistical Science.

Generalized Linear. Mixed Models. Methods and Applications. Modern Concepts, Walter W. Stroup. Texts in Statistical Science. Texts in Statistical Science Generalized Linear Mixed Models Modern Concepts, Methods and Applications Walter W. Stroup CRC Press Taylor & Francis Croup Boca Raton London New York CRC Press is an imprint

More information

THE PRINCIPLES AND PRACTICE OF STATISTICS IN BIOLOGICAL RESEARCH. Robert R. SOKAL and F. James ROHLF. State University of New York at Stony Brook

THE PRINCIPLES AND PRACTICE OF STATISTICS IN BIOLOGICAL RESEARCH. Robert R. SOKAL and F. James ROHLF. State University of New York at Stony Brook BIOMETRY THE PRINCIPLES AND PRACTICE OF STATISTICS IN BIOLOGICAL RESEARCH THIRD E D I T I O N Robert R. SOKAL and F. James ROHLF State University of New York at Stony Brook W. H. FREEMAN AND COMPANY New

More information

One-Way ANOVA. Some examples of when ANOVA would be appropriate include:

One-Way ANOVA. Some examples of when ANOVA would be appropriate include: One-Way ANOVA 1. Purpose Analysis of variance (ANOVA) is used when one wishes to determine whether two or more groups (e.g., classes A, B, and C) differ on some outcome of interest (e.g., an achievement

More information

Unit 10: Simple Linear Regression and Correlation

Unit 10: Simple Linear Regression and Correlation Unit 10: Simple Linear Regression and Correlation Statistics 571: Statistical Methods Ramón V. León 6/28/2004 Unit 10 - Stat 571 - Ramón V. León 1 Introductory Remarks Regression analysis is a method for

More information

MODULE 6 LOGISTIC REGRESSION. Module Objectives:

MODULE 6 LOGISTIC REGRESSION. Module Objectives: MODULE 6 LOGISTIC REGRESSION Module Objectives: 1. 147 6.1. LOGIT TRANSFORMATION MODULE 6. LOGISTIC REGRESSION Logistic regression models are used when a researcher is investigating the relationship between

More information

ANALYSIS OF VARIANCE OF BALANCED DAIRY SCIENCE DATA USING SAS

ANALYSIS OF VARIANCE OF BALANCED DAIRY SCIENCE DATA USING SAS ANALYSIS OF VARIANCE OF BALANCED DAIRY SCIENCE DATA USING SAS Ravinder Malhotra and Vipul Sharma National Dairy Research Institute, Karnal-132001 The most common use of statistics in dairy science is testing

More information

Homework 5: Answer Key. Plausible Model: E(y) = µt. The expected number of arrests arrests equals a constant times the number who attend the game.

Homework 5: Answer Key. Plausible Model: E(y) = µt. The expected number of arrests arrests equals a constant times the number who attend the game. EdPsych/Psych/Soc 589 C.J. Anderson Homework 5: Answer Key 1. Probelm 3.18 (page 96 of Agresti). (a) Y assume Poisson random variable. Plausible Model: E(y) = µt. The expected number of arrests arrests

More information

Inferences About the Difference Between Two Means

Inferences About the Difference Between Two Means 7 Inferences About the Difference Between Two Means Chapter Outline 7.1 New Concepts 7.1.1 Independent Versus Dependent Samples 7.1. Hypotheses 7. Inferences About Two Independent Means 7..1 Independent

More information

Unit 11: Multiple Linear Regression

Unit 11: Multiple Linear Regression Unit 11: Multiple Linear Regression Statistics 571: Statistical Methods Ramón V. León 7/13/2004 Unit 11 - Stat 571 - Ramón V. León 1 Main Application of Multiple Regression Isolating the effect of a variable

More information

Modeling Overdispersion

Modeling Overdispersion James H. Steiger Department of Psychology and Human Development Vanderbilt University Regression Modeling, 2009 1 Introduction 2 Introduction In this lecture we discuss the problem of overdispersion in

More information

Modeling Effect Modification and Higher-Order Interactions: Novel Approach for Repeated Measures Design using the LSMESTIMATE Statement in SAS 9.

Modeling Effect Modification and Higher-Order Interactions: Novel Approach for Repeated Measures Design using the LSMESTIMATE Statement in SAS 9. Paper 400-015 Modeling Effect Modification and Higher-Order Interactions: Novel Approach for Repeated Measures Design using the LSMESTIMATE Statement in SAS 9.4 Pronabesh DasMahapatra, MD, MPH, PatientsLikeMe

More information

Lecture 3: Inference in SLR

Lecture 3: Inference in SLR Lecture 3: Inference in SLR STAT 51 Spring 011 Background Reading KNNL:.1.6 3-1 Topic Overview This topic will cover: Review of hypothesis testing Inference about 1 Inference about 0 Confidence Intervals

More information

Overdispersion Workshop in generalized linear models Uppsala, June 11-12, Outline. Overdispersion

Overdispersion Workshop in generalized linear models Uppsala, June 11-12, Outline. Overdispersion Biostokastikum Overdispersion is not uncommon in practice. In fact, some would maintain that overdispersion is the norm in practice and nominal dispersion the exception McCullagh and Nelder (1989) Overdispersion

More information

Analysis of 2x2 Cross-Over Designs using T-Tests

Analysis of 2x2 Cross-Over Designs using T-Tests Chapter 234 Analysis of 2x2 Cross-Over Designs using T-Tests Introduction This procedure analyzes data from a two-treatment, two-period (2x2) cross-over design. The response is assumed to be a continuous

More information

MIXED MODELS FOR REPEATED (LONGITUDINAL) DATA PART 2 DAVID C. HOWELL 4/1/2010

MIXED MODELS FOR REPEATED (LONGITUDINAL) DATA PART 2 DAVID C. HOWELL 4/1/2010 MIXED MODELS FOR REPEATED (LONGITUDINAL) DATA PART 2 DAVID C. HOWELL 4/1/2010 Part 1 of this document can be found at http://www.uvm.edu/~dhowell/methods/supplements/mixed Models for Repeated Measures1.pdf

More information

Week 7.1--IES 612-STA STA doc

Week 7.1--IES 612-STA STA doc Week 7.1--IES 612-STA 4-573-STA 4-576.doc IES 612/STA 4-576 Winter 2009 ANOVA MODELS model adequacy aka RESIDUAL ANALYSIS Numeric data samples from t populations obtained Assume Y ij ~ independent N(μ

More information

Paper: ST-161. Techniques for Evidence-Based Decision Making Using SAS Ian Stockwell, The Hilltop UMBC, Baltimore, MD

Paper: ST-161. Techniques for Evidence-Based Decision Making Using SAS Ian Stockwell, The Hilltop UMBC, Baltimore, MD Paper: ST-161 Techniques for Evidence-Based Decision Making Using SAS Ian Stockwell, The Hilltop Institute @ UMBC, Baltimore, MD ABSTRACT SAS has many tools that can be used for data analysis. From Freqs

More information

Lecture 14: Introduction to Poisson Regression

Lecture 14: Introduction to Poisson Regression Lecture 14: Introduction to Poisson Regression Ani Manichaikul amanicha@jhsph.edu 8 May 2007 1 / 52 Overview Modelling counts Contingency tables Poisson regression models 2 / 52 Modelling counts I Why

More information

Modelling counts. Lecture 14: Introduction to Poisson Regression. Overview

Modelling counts. Lecture 14: Introduction to Poisson Regression. Overview Modelling counts I Lecture 14: Introduction to Poisson Regression Ani Manichaikul amanicha@jhsph.edu Why count data? Number of traffic accidents per day Mortality counts in a given neighborhood, per week

More information

Sigmaplot di Systat Software

Sigmaplot di Systat Software Sigmaplot di Systat Software SigmaPlot Has Extensive Statistical Analysis Features SigmaPlot is now bundled with SigmaStat as an easy-to-use package for complete graphing and data analysis. The statistical

More information

SAS Software to Fit the Generalized Linear Model

SAS Software to Fit the Generalized Linear Model SAS Software to Fit the Generalized Linear Model Gordon Johnston, SAS Institute Inc., Cary, NC Abstract In recent years, the class of generalized linear models has gained popularity as a statistical modeling

More information

1 Introduction to Minitab

1 Introduction to Minitab 1 Introduction to Minitab Minitab is a statistical analysis software package. The software is freely available to all students and is downloadable through the Technology Tab at my.calpoly.edu. When you

More information

Box-Cox Transformations

Box-Cox Transformations Box-Cox Transformations Revised: 10/10/2017 Summary... 1 Data Input... 3 Analysis Summary... 3 Analysis Options... 5 Plot of Fitted Model... 6 MSE Comparison Plot... 8 MSE Comparison Table... 9 Skewness

More information

Logistic Regression Analysis

Logistic Regression Analysis Logistic Regression Analysis Predicting whether an event will or will not occur, as well as identifying the variables useful in making the prediction, is important in most academic disciplines as well

More information

Design of Engineering Experiments Part 5 The 2 k Factorial Design

Design of Engineering Experiments Part 5 The 2 k Factorial Design Design of Engineering Experiments Part 5 The 2 k Factorial Design Text reference, Special case of the general factorial design; k factors, all at two levels The two levels are usually called low and high

More information

EXAMINATIONS OF THE ROYAL STATISTICAL SOCIETY (formerly the Examinations of the Institute of Statisticians) GRADUATE DIPLOMA, 2007

EXAMINATIONS OF THE ROYAL STATISTICAL SOCIETY (formerly the Examinations of the Institute of Statisticians) GRADUATE DIPLOMA, 2007 EXAMINATIONS OF THE ROYAL STATISTICAL SOCIETY (formerly the Examinations of the Institute of Statisticians) GRADUATE DIPLOMA, 2007 Applied Statistics I Time Allowed: Three Hours Candidates should answer

More information

Checking model assumptions with regression diagnostics

Checking model assumptions with regression diagnostics @graemeleehickey www.glhickey.com graeme.hickey@liverpool.ac.uk Checking model assumptions with regression diagnostics Graeme L. Hickey University of Liverpool Conflicts of interest None Assistant Editor

More information

Using SPSS for One Way Analysis of Variance

Using SPSS for One Way Analysis of Variance Using SPSS for One Way Analysis of Variance This tutorial will show you how to use SPSS version 12 to perform a one-way, between- subjects analysis of variance and related post-hoc tests. This tutorial

More information

Count data page 1. Count data. 1. Estimating, testing proportions

Count data page 1. Count data. 1. Estimating, testing proportions Count data page 1 Count data 1. Estimating, testing proportions 100 seeds, 45 germinate. We estimate probability p that a plant will germinate to be 0.45 for this population. Is a 50% germination rate

More information

Density Temp vs Ratio. temp

Density Temp vs Ratio. temp Temp Ratio Density 0.00 0.02 0.04 0.06 0.08 0.10 0.12 Density 0.0 0.2 0.4 0.6 0.8 1.0 1. (a) 170 175 180 185 temp 1.0 1.5 2.0 2.5 3.0 ratio The histogram shows that the temperature measures have two peaks,

More information

SAS/STAT 14.2 User s Guide. The GENMOD Procedure

SAS/STAT 14.2 User s Guide. The GENMOD Procedure SAS/STAT 14.2 User s Guide The GENMOD Procedure This document is an individual chapter from SAS/STAT 14.2 User s Guide. The correct bibliographic citation for this manual is as follows: SAS Institute Inc.

More information

STAT 512 MidTerm I (2/21/2013) Spring 2013 INSTRUCTIONS

STAT 512 MidTerm I (2/21/2013) Spring 2013 INSTRUCTIONS STAT 512 MidTerm I (2/21/2013) Spring 2013 Name: Key INSTRUCTIONS 1. This exam is open book/open notes. All papers (but no electronic devices except for calculators) are allowed. 2. There are 5 pages in

More information

Mohammed. Research in Pharmacoepidemiology National School of Pharmacy, University of Otago

Mohammed. Research in Pharmacoepidemiology National School of Pharmacy, University of Otago Mohammed Research in Pharmacoepidemiology (RIPE) @ National School of Pharmacy, University of Otago What is zero inflation? Suppose you want to study hippos and the effect of habitat variables on their

More information

A CASE STUDY IN HANDLING OVER-DISPERSION IN NEMATODE COUNT DATA SCOTT EDWIN DOUGLAS KREIDER. B.S., The College of William & Mary, 2008 A THESIS

A CASE STUDY IN HANDLING OVER-DISPERSION IN NEMATODE COUNT DATA SCOTT EDWIN DOUGLAS KREIDER. B.S., The College of William & Mary, 2008 A THESIS A CASE STUDY IN HANDLING OVER-DISPERSION IN NEMATODE COUNT DATA by SCOTT EDWIN DOUGLAS KREIDER B.S., The College of William & Mary, 2008 A THESIS submitted in partial fulfillment of the requirements for

More information

Poisson regression: Further topics

Poisson regression: Further topics Poisson regression: Further topics April 21 Overdispersion One of the defining characteristics of Poisson regression is its lack of a scale parameter: E(Y ) = Var(Y ), and no parameter is available to

More information

The GENMOD Procedure. Overview. Getting Started. Syntax. Details. Examples. References. SAS/STAT User's Guide. Book Contents Previous Next

The GENMOD Procedure. Overview. Getting Started. Syntax. Details. Examples. References. SAS/STAT User's Guide. Book Contents Previous Next Book Contents Previous Next SAS/STAT User's Guide Overview Getting Started Syntax Details Examples References Book Contents Previous Next Top http://v8doc.sas.com/sashtml/stat/chap29/index.htm29/10/2004

More information

Lab 3: Two levels Poisson models (taken from Multilevel and Longitudinal Modeling Using Stata, p )

Lab 3: Two levels Poisson models (taken from Multilevel and Longitudinal Modeling Using Stata, p ) Lab 3: Two levels Poisson models (taken from Multilevel and Longitudinal Modeling Using Stata, p. 376-390) BIO656 2009 Goal: To see if a major health-care reform which took place in 1997 in Germany was

More information

SPSS Guide For MMI 409

SPSS Guide For MMI 409 SPSS Guide For MMI 409 by John Wong March 2012 Preface Hopefully, this document can provide some guidance to MMI 409 students on how to use SPSS to solve many of the problems covered in the D Agostino

More information

Topic 8. Data Transformations [ST&D section 9.16]

Topic 8. Data Transformations [ST&D section 9.16] Topic 8. Data Transformations [ST&D section 9.16] 8.1 The assumptions of ANOVA For ANOVA, the linear model for the RCBD is: Y ij = µ + τ i + β j + ε ij There are four key assumptions implicit in this model.

More information

df=degrees of freedom = n - 1

df=degrees of freedom = n - 1 One sample t-test test of the mean Assumptions: Independent, random samples Approximately normal distribution (from intro class: σ is unknown, need to calculate and use s (sample standard deviation)) Hypotheses:

More information

Horse Kick Example: Chi-Squared Test for Goodness of Fit with Unknown Parameters

Horse Kick Example: Chi-Squared Test for Goodness of Fit with Unknown Parameters Math 3080 1. Treibergs Horse Kick Example: Chi-Squared Test for Goodness of Fit with Unknown Parameters Name: Example April 4, 014 This R c program explores a goodness of fit test where the parameter is

More information

Hypothesis Testing for Var-Cov Components

Hypothesis Testing for Var-Cov Components Hypothesis Testing for Var-Cov Components When the specification of coefficients as fixed, random or non-randomly varying is considered, a null hypothesis of the form is considered, where Additional output

More information

Chapter 19: Logistic regression

Chapter 19: Logistic regression Chapter 19: Logistic regression Self-test answers SELF-TEST Rerun this analysis using a stepwise method (Forward: LR) entry method of analysis. The main analysis To open the main Logistic Regression dialog

More information

A Re-Introduction to General Linear Models (GLM)

A Re-Introduction to General Linear Models (GLM) A Re-Introduction to General Linear Models (GLM) Today s Class: You do know the GLM Estimation (where the numbers in the output come from): From least squares to restricted maximum likelihood (REML) Reviewing

More information

Data Analyses in Multivariate Regression Chii-Dean Joey Lin, SDSU, San Diego, CA

Data Analyses in Multivariate Regression Chii-Dean Joey Lin, SDSU, San Diego, CA Data Analyses in Multivariate Regression Chii-Dean Joey Lin, SDSU, San Diego, CA ABSTRACT Regression analysis is one of the most used statistical methodologies. It can be used to describe or predict causal

More information

Advanced Regression Topics: Violation of Assumptions

Advanced Regression Topics: Violation of Assumptions Advanced Regression Topics: Violation of Assumptions Lecture 7 February 15, 2005 Applied Regression Analysis Lecture #7-2/15/2005 Slide 1 of 36 Today s Lecture Today s Lecture rapping Up Revisiting residuals.

More information

BIOL Biometry LAB 6 - SINGLE FACTOR ANOVA and MULTIPLE COMPARISON PROCEDURES

BIOL Biometry LAB 6 - SINGLE FACTOR ANOVA and MULTIPLE COMPARISON PROCEDURES BIOL 458 - Biometry LAB 6 - SINGLE FACTOR ANOVA and MULTIPLE COMPARISON PROCEDURES PART 1: INTRODUCTION TO ANOVA Purpose of ANOVA Analysis of Variance (ANOVA) is an extremely useful statistical method

More information

Introduction to Statistical Data Analysis Lecture 8: Correlation and Simple Regression

Introduction to Statistical Data Analysis Lecture 8: Correlation and Simple Regression Introduction to Statistical Data Analysis Lecture 8: and James V. Lambers Department of Mathematics The University of Southern Mississippi James V. Lambers Statistical Data Analysis 1 / 40 Introduction

More information

Inference with Simple Regression

Inference with Simple Regression 1 Introduction Inference with Simple Regression Alan B. Gelder 06E:071, The University of Iowa 1 Moving to infinite means: In this course we have seen one-mean problems, twomean problems, and problems

More information

Generalized Linear Models for Count, Skewed, and If and How Much Outcomes

Generalized Linear Models for Count, Skewed, and If and How Much Outcomes Generalized Linear Models for Count, Skewed, and If and How Much Outcomes Today s Class: Review of 3 parts of a generalized model Models for discrete count or continuous skewed outcomes Models for two-part

More information

Preface Introduction to Statistics and Data Analysis Overview: Statistical Inference, Samples, Populations, and Experimental Design The Role of

Preface Introduction to Statistics and Data Analysis Overview: Statistical Inference, Samples, Populations, and Experimental Design The Role of Preface Introduction to Statistics and Data Analysis Overview: Statistical Inference, Samples, Populations, and Experimental Design The Role of Probability Sampling Procedures Collection of Data Measures

More information

Review of Multiple Regression

Review of Multiple Regression Ronald H. Heck 1 Let s begin with a little review of multiple regression this week. Linear models [e.g., correlation, t-tests, analysis of variance (ANOVA), multiple regression, path analysis, multivariate

More information

Example 7b: Generalized Models for Ordinal Longitudinal Data using SAS GLIMMIX, STATA MEOLOGIT, and MPLUS (last proportional odds model only)

Example 7b: Generalized Models for Ordinal Longitudinal Data using SAS GLIMMIX, STATA MEOLOGIT, and MPLUS (last proportional odds model only) CLDP945 Example 7b page 1 Example 7b: Generalized Models for Ordinal Longitudinal Data using SAS GLIMMIX, STATA MEOLOGIT, and MPLUS (last proportional odds model only) This example comes from real data

More information

The GENMOD Procedure (Book Excerpt)

The GENMOD Procedure (Book Excerpt) SAS/STAT 9.22 User s Guide The GENMOD Procedure (Book Excerpt) SAS Documentation This document is an individual chapter from SAS/STAT 9.22 User s Guide. The correct bibliographic citation for the complete

More information

Parametric versus Nonparametric Statistics-when to use them and which is more powerful? Dr Mahmoud Alhussami

Parametric versus Nonparametric Statistics-when to use them and which is more powerful? Dr Mahmoud Alhussami Parametric versus Nonparametric Statistics-when to use them and which is more powerful? Dr Mahmoud Alhussami Parametric Assumptions The observations must be independent. Dependent variable should be continuous

More information

Statistics Boot Camp. Dr. Stephanie Lane Institute for Defense Analyses DATAWorks 2018

Statistics Boot Camp. Dr. Stephanie Lane Institute for Defense Analyses DATAWorks 2018 Statistics Boot Camp Dr. Stephanie Lane Institute for Defense Analyses DATAWorks 2018 March 21, 2018 Outline of boot camp Summarizing and simplifying data Point and interval estimation Foundations of statistical

More information

Varieties of Count Data

Varieties of Count Data CHAPTER 1 Varieties of Count Data SOME POINTS OF DISCUSSION What are counts? What are count data? What is a linear statistical model? What is the relationship between a probability distribution function

More information

Review: what is a linear model. Y = β 0 + β 1 X 1 + β 2 X 2 + A model of the following form:

Review: what is a linear model. Y = β 0 + β 1 X 1 + β 2 X 2 + A model of the following form: Outline for today What is a generalized linear model Linear predictors and link functions Example: fit a constant (the proportion) Analysis of deviance table Example: fit dose-response data using logistic

More information

Ratio of Polynomials Fit One Variable

Ratio of Polynomials Fit One Variable Chapter 375 Ratio of Polynomials Fit One Variable Introduction This program fits a model that is the ratio of two polynomials of up to fifth order. Examples of this type of model are: and Y = A0 + A1 X

More information

Chapter 16. Simple Linear Regression and Correlation

Chapter 16. Simple Linear Regression and Correlation Chapter 16 Simple Linear Regression and Correlation 16.1 Regression Analysis Our problem objective is to analyze the relationship between interval variables; regression analysis is the first tool we will

More information

Introduction to SAS proc mixed

Introduction to SAS proc mixed Faculty of Health Sciences Introduction to SAS proc mixed Analysis of repeated measurements, 2017 Julie Forman Department of Biostatistics, University of Copenhagen 2 / 28 Preparing data for analysis The

More information

Diagnostics and Remedial Measures

Diagnostics and Remedial Measures Diagnostics and Remedial Measures Yang Feng http://www.stat.columbia.edu/~yangfeng Yang Feng (Columbia University) Diagnostics and Remedial Measures 1 / 72 Remedial Measures How do we know that the regression

More information

STAT Checking Model Assumptions

STAT Checking Model Assumptions STAT 704 --- Checking Model Assumptions Recall we assumed the following in our model: (1) The regression relationship between the response and the predictor(s) specified in the model is appropriate (2)

More information

A SAS/AF Application For Sample Size And Power Determination

A SAS/AF Application For Sample Size And Power Determination A SAS/AF Application For Sample Size And Power Determination Fiona Portwood, Software Product Services Ltd. Abstract When planning a study, such as a clinical trial or toxicology experiment, the choice

More information

6 Single Sample Methods for a Location Parameter

6 Single Sample Methods for a Location Parameter 6 Single Sample Methods for a Location Parameter If there are serious departures from parametric test assumptions (e.g., normality or symmetry), nonparametric tests on a measure of central tendency (usually

More information

Chapter 4 Part 3. Sections Poisson Distribution October 2, 2008

Chapter 4 Part 3. Sections Poisson Distribution October 2, 2008 Chapter 4 Part 3 Sections 4.10-4.12 Poisson Distribution October 2, 2008 Goal: To develop an understanding of discrete distributions by considering the binomial (last lecture) and the Poisson distributions.

More information

Group comparisons in logit and probit using predicted probabilities 1

Group comparisons in logit and probit using predicted probabilities 1 Group comparisons in logit and probit using predicted probabilities 1 J. Scott Long Indiana University May 27, 2009 Abstract The comparison of groups in regression models for binary outcomes is complicated

More information