An Approached of Box-Cox Data Transformation to Biostatistics Experiment

Similar documents
The Model Building Process Part I: Checking Model Assumptions Best Practice

The Model Building Process Part I: Checking Model Assumptions Best Practice (Version 1.1)

Topic 23: Diagnostics and Remedies

Introduction to Linear regression analysis. Part 2. Model comparisons

Lec 3: Model Adequacy Checking

Inference with Simple Regression

Preface Introduction to Statistics and Data Analysis Overview: Statistical Inference, Samples, Populations, and Experimental Design The Role of

Analysis of Covariance. The following example illustrates a case where the covariate is affected by the treatments.

Checking model assumptions with regression diagnostics

Heteroscedasticity 1

THE PRINCIPLES AND PRACTICE OF STATISTICS IN BIOLOGICAL RESEARCH. Robert R. SOKAL and F. James ROHLF. State University of New York at Stony Brook

Non-parametric (Distribution-free) approaches p188 CN

1 Introduction to Minitab

Box-Cox Transformations

a = 4 levels of treatment A = Poison b = 3 levels of treatment B = Pretreatment n = 4 replicates for each treatment combination

Outline. Possible Reasons. Nature of Heteroscedasticity. Basic Econometrics in Transportation. Heteroscedasticity

Topic 8. Data Transformations [ST&D section 9.16]

INFLUENCE OF USING ALTERNATIVE MEANS ON TYPE-I ERROR RATE IN THE COMPARISON OF INDEPENDENT GROUPS ABSTRACT

Institutionen för matematik och matematisk statistik Umeå universitet November 7, Inlämningsuppgift 3. Mariam Shirdel

Lecture 18: Simple Linear Regression

INFERENCE FOR REGRESSION

K. Model Diagnostics. residuals ˆɛ ij = Y ij ˆµ i N = Y ij Ȳ i semi-studentized residuals ω ij = ˆɛ ij. studentized deleted residuals ɛ ij =

EE290H F05. Spanos. Lecture 5: Comparison of Treatments and ANOVA

Model Building Chap 5 p251

Nonparametric Statistics. Leah Wright, Tyler Ross, Taylor Brown

DETAILED CONTENTS PART I INTRODUCTION AND DESCRIPTIVE STATISTICS. 1. Introduction to Statistics

DESIGN AND ANALYSIS OF EXPERIMENTS Third Edition

Basic Business Statistics 6 th Edition

Statistical. Psychology

Question Possible Points Score Total 100

Lecture 4. Checking Model Adequacy

Analysis of Covariance

A Note on Visualizing Response Transformations in Regression

Conditions for Regression Inference:

Describing distributions with numbers

Fractional Polynomial Regression

Psychology 282 Lecture #4 Outline Inferences in SLR

ORF 245 Fundamentals of Engineering Statistics. Final Exam

Contents. Acknowledgments. xix

A Test of Symmetry. Abdul R. Othman. Universiti Sains Malaysia. H. J. Keselman. University of Manitoba. Rand R. Wilcox

Estadística II Chapter 5. Regression analysis (second part)

Table of z values and probabilities for the standard normal distribution. z is the first column plus the top row. Each cell shows P(X z).

Basic Statistics Exercises 66

STAT 350 Final (new Material) Review Problems Key Spring 2016

Chapter 16. Simple Linear Regression and Correlation

Rule of Thumb Think beyond simple ANOVA when a factor is time or dose think ANCOVA.

Six Sigma Black Belt Study Guides

The ε ij (i.e. the errors or residuals) are normally distributed. This assumption has the least influence on the F test.

Model Based Statistics in Biology. Part V. The Generalized Linear Model. Chapter 18.1 Logistic Regression (Dose - Response)

Prerequisite: STATS 7 or STATS 8 or AP90 or (STATS 120A and STATS 120B and STATS 120C). AP90 with a minimum score of 3

Inferences for Regression

STAT 512 MidTerm I (2/21/2013) Spring 2013 INSTRUCTIONS

Assessing Model Adequacy

Objective Experiments Glossary of Statistical Terms

Chapter 26 Multiple Regression, Logistic Regression, and Indicator Variables

Outline. Topic 20 - Diagnostics and Remedies. Residuals. Overview. Diagnostics Plots Residual checks Formal Tests. STAT Fall 2013

23. Inference for regression

Part 8: GLMs and Hierarchical LMs and GLMs

An automatic report for the dataset : affairs

Residual Analysis for two-way ANOVA The twoway model with K replicates, including interaction,

ANOVA Situation The F Statistic Multiple Comparisons. 1-Way ANOVA MATH 143. Department of Mathematics and Statistics Calvin College

Design of Experiments SUTD - 21/4/2015 1

Chapter 1 Statistical Inference

Why Data Transformation? Data Transformation. Homoscedasticity and Normality. Homoscedasticity and Normality

SPSS LAB FILE 1

Stat 135, Fall 2006 A. Adhikari HOMEWORK 10 SOLUTIONS

sphericity, 5-29, 5-32 residuals, 7-1 spread and level, 2-17 t test, 1-13 transformations, 2-15 violations, 1-19

9 Correlation and Regression

STA 101 Final Review

ESP 178 Applied Research Methods. 2/23: Quantitative Analysis

AMS 7 Correlation and Regression Lecture 8

Ecn Analysis of Economic Data University of California - Davis February 23, 2010 Instructor: John Parman. Midterm 2. Name: ID Number: Section:

Optimal Cusum Control Chart for Censored Reliability Data with Log-logistic Distribution

Homework 2: Simple Linear Regression

Introduction to statistical modeling

Design & Analysis of Experiments 7E 2009 Montgomery

Analysis of variance, multivariate (MANOVA)

Hypothesis Testing. Hypothesis: conjecture, proposition or statement based on published literature, data, or a theory that may or may not be true

Applied Regression Modeling: A Business Approach Chapter 3: Multiple Linear Regression Sections

Parameter Estimation and Statistical Test in Modeling Geographically Weighted Poisson Inverse Gaussian Regression

Confidence Interval for the mean response

This document contains 3 sets of practice problems.

Parametric versus Nonparametric Statistics-when to use them and which is more powerful? Dr Mahmoud Alhussami

Chapter 16. Simple Linear Regression and dcorrelation

Two-by-two ANOVA: Global and Graphical Comparisons Based on an Extension of the Shift Function

Identification of dispersion effects in replicated two-level fractional factorial experiments

Bivariate data analysis

Practical Statistics for the Analytical Scientist Table of Contents

DESIGNING EXPERIMENTS AND ANALYZING DATA A Model Comparison Perspective

PART I. (a) Describe all the assumptions for a normal error regression model with one predictor variable,

Tento projekt je spolufinancován Evropským sociálním fondem a Státním rozpočtem ČR InoBio CZ.1.07/2.2.00/

Taguchi Method and Robust Design: Tutorial and Guideline

Density Temp vs Ratio. temp

Probability Distributions.

Table 1: Fish Biomass data set on 26 streams

This gives us an upper and lower bound that capture our population mean.

VARIANCE COMPONENT ANALYSIS

Midterm 2 - Solutions

Unit 12: Analysis of Single Factor Experiments

Advanced Quantitative Data Analysis

Transcription:

Statistika, Vol. 6 No., 6 Nopember 6 An Approached of Box-Cox Data Transformation to Biostatistics Experiment Wan Muhamad Amir W Ahmad, Nyi Nyi Naing, Norhayati Rosli Jabatan Matematik, Fakulti Sains dan Teknologi, Malaysia, Kolej Universiti Sains dan Teknologi Malaysia (KUSTEM), Kuala Terengganu, Terengganu Malaysia. Unit Biostatistik, Pusat Pengajian Sains Perubatan, Universiti Sains Malaysia, USM, Kampus Kesihatan, 6 Kubang Kerian, Kelantan, Malaysia. Fakulti Kejuruteraan Kimia dan Sumber Asli, Kolej Universiti Kejuruteraan dan Teknologi Malaysia, Kuantan Pahang, Malaysia. Abstract The Box-Cox family of transformation is a well-known approach to make data behave accordingly to assumption of linear regression and ANOVA. The regression coefficients, as well as the parameter defining the transformation are generally estimated by maximum likelihood, assuming homoscedastic normal error. In application of ANOVA for hypothesis testing in biostatistics science experiments, the assumption of homogeneity of errors often is violating because of scale effects and the nature of the measurements. We demonstrate a method of transformation data so that the assumptions of ANOVA are met (or violated to a lesser degree) and apply it in analysis of data from biostatistics experiments. We will illustrate the use of the Box-Cox method by using MINITAB software. Keywords: Box-Cox transformation and parameter. Introduction Since the seminal by Box and Cox (96), the Box -Cox types of power transformation have generated a great deal of interests, both in theoretical work and in practical applications. The Box-Cox family of transformation has become a widely used tool to make data behave according to a linear regression model. Sakia (99) has given an excellent review of the work relating to this transformation. The response variable, transformed according to the Box-Cox procedure, is usually assumed to be linearly related to its covariates and the errors normally distributed with constant variance. The data that required an ANOVA application, errors need to be independent, have a normal distribution with zero mean, and be similar between treatments. The assumption of homogenous variances often is violated in biological experiments because treatments that result in a change in the mean of a given response variable often are accompanied by changes in error variances (a scale effect). Transformation of the data is usually the useful method of alleviating heterogeneity because it is applicable to all experimental designs analyzed with ANOVA, and conversion of data from interval or ratio values to ranks, as in nonparametric procedures, results in a loss of information. This loss in information is usually reflected by a loss of power of the statistical test. The goal of data transformation is to change the scale by which the data are analyzed so that the variances are not heterogeneous. However, the transformation that is most effective in reducing the heterogeneity of variance often is not obvious and often must be found by trial and error. Box and Cox (96) presented a family of transformations and a computational technique to select a transformation that will best resolve the problems of non-normality and heterogeneity of error. Despite its power and desirable properties, the Box-Cox transformation apparently is rarely used in the statistical analysis of biostatistics data. In this report, we provide an overview of the Box-Cox data transformation and provide an illustrative example for its application in the analysis of data from a biostatistics experiment. The results from the untransformed data and the results from the transform data are discussed.

Wan Muhamad Amir W Ahmad, Nyi Nyi Naing, Norhayati Rosli. Materials and Methods The method presented by Box and Cox (96) is based on the observation that the mean ( of a population such that () is often proportional to the standard deviation ) a (Damon and Harvey, 987; Montgomery, 99). The purpose of transformation is to raise the data to power such that the correlation between the mean and the standard deviation is reduced or eliminated. Box and Cox (96) provided an algorithm by which the optimum value for the transformation parameter.is selected by the method of maximum likelihood. This technique involves performing a series of analyses of variance, for various values of, transformed as y ( ) where, y y y ln y ln y y ln n the geometric mean of the observation (y). The analysis of variance for that yields the lowest error sums of squares then is used for hypothesis testing (Peltier, 998).. Case Study This study is done based on the data that has been used by Box and Cox. We used the Box-Cox data after considering the results of their study which is very useful in general statistics and the used of transformation especially. Although the data that has been studied by Box and Cox is presented in this study case, but we will illustrated it by the different way. Table. shows the dataset of lifetime animal in factorial design of experiment with level of poison factor and level of treatment factor. Table. Dataset of lifetime animal in factorial design of experiment. Diagnostic Checking Poison Treatment A B C D I..8......7.6.88.6.66..7.76.6 II.6.9..6.9.6....9..7....8 III......7..6.8.8....9.. Source : Box and Cox 96 The first step that we should do is plotting the normal probability plot to the response variable. Normal probability plot is done to verify whether the data that we study is fulfilling the assumption of normality. If, there are a sign that show the data is deflecting from the assumption of normality, then the transformation is needed. Figure. illustrates the normal Statistika, Vol. 6, No., Nopember 6

An Approached of Box-Cox Data Transformation to Biostatistics Experiment probability plot of the data. From the normal probability plot, we can see that the normality assumption is not fulfilled by the response variables. The structure of the plot in Figure. displays the deflections in point. Thus, there is enough evidence to say that the normality assumption is contravened in this case. To support this interpretation let us look at the Figure.. From the figure we can see that the Normal Plot of swerves. These results indicate that the residual is not normally distributed. Histogram also shows that the residual is not normally distributed. The right skews in histogram give us the sign that the data is not normally distributed. When the data is normally distributed, the residual should be in a bell shape or nearly bell a shape. Finally, the plot of versus Fit indicates that the variance of error is not homogenous. We can say that the data is not fulfilled the assumption of normal distribution. So, the transformation of the data is required. The plot Box-Cox in Figure. shows that, the value do not include in the 9% of confidence interval. Since these limits do not include the value, we conclude that the transformation is needed. Normal Probability Plot for values 99 9 9 Mean: StDev:.797.89 Percent 8 7 6.. Data. Figure. Normal probability plot of the data. Remedial Measurement From the diagnostics checking techniques, we can conclude that there is a need of transformation. According to the Box-Cox, we have to determine the suitable values of parameter. According to the Figure. the best value for is -. Difference values of in 9% of given confidence interval also can be used for calculation but it is easy to select the value of. The transformation can be employ after we select the value of parameter. Box-Cox (96) mentioned that the value of - for parameter is representing the inverse of transformation. Once the transformation of the data has been done, the normality of the data transformation need to be verified. Figure. given the Normal probability plot of the data after transformation. Statistika, Vol. 6, No., Nopember 6

Wan Muhamad Amir W Ahmad, Nyi Nyi Naing, Norhayati Rosli Normal Plot of s I Chart of s...... -. -. -. - - Normal Score.. -. Observation Number.SL=.968 X=. -.SL=-.968 Histogram of s s vs. Fits Frequency -.-.-............. -. -. -...... Fit.6.7.8 Figure. Model Diagnostics for Data Before Transformation Box-Cox Plot for values StDev... 9% Confidence Interval Last Iteration Info Lambda StDev Low -.68. Est -.6. Up -.6.. - - - - - Lambda Figure. Box-Cox Plot for Values Statistika, Vol. 6, No., Nopember 6

An Approached of Box-Cox Data Transformation to Biostatistics Experiment Normal Probability Plot for values 99 9 9 Mean: StDev:.79.77799 Percent 8 7 6 Data Figure. Normal Probability Plot of the Data After Transformation Figure. Normal probability plot of the data after transformation Normal Plot of s I Chart of s.sl=.9 X=. - - -.SL=-.9 - - Normal Score Observation Number Histogram of s s vs. Fits Frequency -.-. -.7-. -...7 - Fit Figure. Model Diagnostics for Data After Transformation The Normal Plot of is shown in Figure.. From Figure. we can see that this plot is almost linear and this trend explained that it is normally distributed. Plot of Statistika, Vol. 6, No., Nopember 6

6 Wan Muhamad Amir W Ahmad, Nyi Nyi Naing, Norhayati Rosli Histogram illustrate almost bell a shape. This mentioned that the residual is approximating normally distributed as well. Lastly, the plot of versus Fit indicates that the variance of error is homogenous. So, we can conclude that the data transformation is fulfilled the assumption of normal distribution. The adequacy of data transformation, once again will be check by do a Box-Cox plot and it is shown in Figure.6. The Box-Cox plot in Figure.6 demonstrate that there is a value in 9% of confidence interval. Thus, the Box-Cox transformation is accomplishment. Once a Box-Cox transformation has been done, the transformation data can be used in order to employ a parameter analysis. 9% Confidence Interval Last Iteration Info Lambda StDev Low.9.8 Est..8 Up.67.9 StDev - - - - - Lambda Figure.6 Box-Cox Plot for Values After Transformation. Discussion The Box-Cox algorithm provides a simple method to determine the best way to transform the data for reducing heterogeneity of errors. This transformation also is well adapted to bringing heavily skewed data sets to near normality (Draper and Cox 969). The Box -Cox data transformation is a simple method that enable analysis of heteroscedastic and non-normal data sets so that the assumption of the analysis of variance might be satisfied especially when other transformation procedure fail.. Reference [] Andrew, D.F. (97). A Note on the Selection of Data Transformation. Biometrical 8, 9-. [] Anscombe, F.J. (96). Examination of. Proc. Fourth Berkeley Symp. On Mathematical Statistics and Probability, -6. University of California Press, Berkeley. [] Atkinson, A.C. (97). Testing Transformation to Normality. Journal of the Royal Stat. Society Ser. B,, 7-79. [] Atkinson, A.C. (986a). Diagnostics Test for Transformations, Technometrics 8, 9-8 [] Bartlett, M.S. (97) The Use of Transformation. Biometrical 9-. [6] Box, G.E.P., and D.R. Cox (96). An Analysis of Transformation. J.R. Stat Soc. B 6:-. [7] Damon, R. A., Jr., and W. R. Harvey. 987. Experimental Design, ANOVA and Regression. Harper and Row, New York. [8] Draper, N. R., and D.R. Cox (969). On Distribution And Their Transformation to Normality..R. Stat Soc. B :7-76. [9] Montgomery, D. C. 99. Design and Analysis of Experiments (rd Ed.). John Wiley & Sons, New York. [] Peltier, M. R. 996. Effect of melatonin implantation at the summer solstice and ovarian status on the annual reproductive rhythm in pony mares. M.S. Thesis. University of Florida, Gainesville. Statistika, Vol. 6, No., Nopember 6